The Gemini 3 Pro model, developed by Google, has taken the lead in key benchmarks[1], signaling a shift in power in the advanced language model market. In the GPQA Diamond test, it scored 91.9% (with Deep Think mode 93.8%) versus 88.1% for GPT-5.1, while in the ARC-AGI-2 test it achieved 31.1% 45.1%) compared to 17.6% for GPT-5.1. In the HUMAINE study involving 26 thousand users, the trust rating for Gemini 3 Pro reached 69%, while Gemini 2.5 Pro had only 16%. In coding, the model achieved 76.2% on SWE-Bench[3], slightly trailing the Claude Opus 4.5 model from Anthropic[4], but outperforming GPT-5.1 in the Live. Code. Bench Pro ranking (Elo 2,439 versus 2,243). At the same time, Gemini 3 Pro, Grok 4.1 from xAI, and Llama 4 offer a context of 1 million tokens, while GPT-5.2 from OpenAI reaches 400 thousand tokens, which directly impacts analyzing large code bases or documents without data splitting.
Gemini 3 Pro Outpaces Competition
In parallel, xAI, founded by Elon Musk[11], closed a Series E round worth $20 billion[6], valued at approximately $230 billion. The funding structure includes about $7.5 billion in equity[7] and $12.5 billion in GPU-backed debt for five years, with a declared cash burn rate of $1 billion per month. Investors include Nvidia (around $2 billion), Cisco, Fidelity, Qatar Investment Authority, and Tesla, which added $2 billion on January 28[9]. The funds will support development of the next-generation Grok model, utilizing a Memphis campus complex with about 2 GW capacity. At the same time, xAI faces a serious reputational crisis – since January 9 Grok generated sexualized images of minors[17], detected by the UK organization Internet Watch Foundation, and Elon Musk’s child’s mother filed a lawsuit against xAI on January 17[18]. The California Attorney General demanded cessation of violations, and Indonesian authorities blocked access to Grok. According to Copyleaks, at peak times at least one such image appeared per second. In response, from January 29 xAI limited image generation only to paid subscribers on the X platform.
Megafunding for xAI and Deepfake Crisis
In cybersecurity, the Torq AI SOC platform, developed by Torq led by Ofer Smadari, raised $140 million in a Series D round[12] led by Merlin Ventures, achieving a $1.2 billion valuation[13] and total funding of $332 million. The company serves clients including Marriott, Pepsi. Co, Procter & Gamble, Siemens, Uber, and Virgin Atlantic, claiming to reduce alert analysis time by 90%[14] and to handle a hundredfold more threats without increasing headcount. However, research data from teams at ETH Zurich, Carnegie Mellon University, and organizations collaborating with IEEE indicate the agent ecosystem remains vulnerable: Of 42,447 analyzed agent skills, 26.1% contained at least one vulnerability[20], including 13.3% allowing data leaks, 11.8% with escalation of privileges risk, and 5.2% with malicious intent of high severity. Analyses revealed a remote code execution flaw in Git. Hub Copilot[21] (CVE-2025-53773, rated 9.6 on the CVSS scale) as well as the Echo. Leak vulnerability (CVE-2025-32711) in Microsoft 365 Copilot, enabling data leakage after sending specially crafted emails[23].
AI Agents in Cybersecurity and Business
Service. Now in January began global deployment of the Claude model[16] as the default engine in its AI Platform and Build Agent tools, covering 29 thousand employees. According to Anthropic, integrating sales tools with Claude reduces preparation time for sales calls by 95%, while in healthcare and life sciences the model handles autonomous document reviews. The goal is to shorten client deployment times by 50%. In the multimedia segment, the company Synthesia, led by Victor Riparbelli, raised $200 million in a Series E round[28] led by Google Ventures, valued at $4 billion with a client base including 90% of Fortune 100 companies. The company shifts focus from one-sided video recordings to conversational agent educational experiences[29], based on interactive simulations and user personalization, arguing that this learning format accelerates knowledge transfer and increases engagement.
Chinese Competition and New Regulations
Competitive tensions grow in the Chinese market around open-weight models. According to Reuters, Deep. Seek V4 is set to debut in mid-February 2026[24], and internal tests suggest an advantage in programming tasks over Claude and GPT models. Meanwhile, Alibaba’s Qwen3 scores 92.3% on AIME25 and 74.1% on Live. Code. Bench, comparable in quality to GPT-4o. Based on July 2025 data, Deep. Seek holds about 4% of the global chatbot market[25], and AI-related job applications in China rose by 39% in the first three quarters of 2025. Companies Zhipu AI and Mini. Max prepare Hong Kong IPOs, while Byte. Dance and Alibaba announce upcoming Doubao 2.0, Seeddream 5.0, and Seeddance 2.0 model versions for February[26]. Chinese models, often released under open-weight licenses and costing 10–50 times less at over 90% quality parity, may force price adjustments in the U. S., especially considering U. S. export restrictions on advanced chips.
Model Hallucinations Limit Agent Autonomy
At the regulatory level, on January 1, 2026, California’s SB 53 law came into effect[32], prepared by the state legislature and signed by Governor Gavin Newsom. The regulations require publication of risk management frameworks[35], reporting of serious security incidents including chemical, biological, radiological, and nuclear capabilities, autonomous cyberattacks, or loss of AI system control. Incident reporting must occur within 15 days of detection[36], or within 24 hours in case of immediate threat to life or health. Penalties for violations can reach $1 million per offense, enforced by the California Governor’s Office of Emergency Services (Cal OES). Meanwhile, the Texas RAIGA law also took effect, and Colorado is preparing its own AI regulations for June 30, 2026, while the EU AI Act will impose transparency obligations from August 2, 2026.
Despite inference advances, language models still produce high error rates. The AI Multiple report from January 2026 covering 37 models indicates that the rate of incorrect or hallucinated responses ranges from 15% to 52%[78]. A study of Duke University students showed that 94% of respondents view model accuracy as highly variable[40], and 90% expect clearer communication of limitations. At the NeurIPS 2025 conference, more than 100 hallucinated citations were revealed[60] in accepted papers, with an acceptance rate of 24.5%. The Apple research team noted that internal model representations store more information about statement truthfulness[42] than previously assumed, potentially opening paths to new fact-checking methods. Duke’s analysis identified training data quality, pragmatic conditions (context, tone, nuances), and evaluation metrics favoring confidence over factual correctness as main error causes. Practically, this means that in high-risk fields like medicine, law, or finance, AI agent systems must operate under mandatory human oversight, and regulators may eventually include hallucination metrics among safety requirements.
Sources
- [1] vellum.ai
- [3] evolink.ai
- [4] shakudo.io
- [6] sullcrom.com
- [7] finance.yahoo.com
- [9] cnbc.com
- [11] cnbc.com
- [12] torq.io
- [13] thesaasnews.com
- [14] bankinfosecurity.com
- [16] releasebot.io
- [17] nytimes.com
- [18] aljazeera.com
- [20] semanticscholar.org
- [21] mdpi.com
- [23] lasso.security
- [24] reuters.com
- [25] aiagentstore.ai
- [26] trendforce.com
- [28] thesaasnews.com
- [29] synthesia.io
- [32] bakerbotts.com
- [35] goodwinlaw.com
- [36] wsgrdataadvisor.com
- [40] blogs.library.duke.edu
- [42] machinelearning.apple.com
- [60] arxiv.org
- [78] research.aimultiple.com
Related posts:
- LangGrinch, GPT-5.2, and Genesis Mission: Key AI Agent Events (Dec 26, 2025–Jan 2, 2026)
- Breakthrough Week in AI: China’s Offensive, Agent Platforms, and Security Crises
- AI Agents Enter Offices, Attacks Surge, and AI Act Pressure Intensifies
- AI in Medicine, Advertising, and Security: A Week of Breakthrough Decisions
