China’s DeepSeek Unveils AI Model Halving Costs by 50% – The ‘Sparse Attention’ Revolution

New model announced: DeepSeek on Sept. 29 released its experimental LLM DeepSeek-V3.2-Exp, which introduces a novel “DeepSeek Sparse Attention” mechanism focusing computation on key tokens ^[1] ^[2].
Huge cost cuts: The startup says this approach slashes API inference costs by roughly 50% for long-text tasks ^[3] ^[4]. In early tests, the price of a typical call dropped by as much as half when processing very long contexts ^[5] ^[6].
Open source release: V3.2-Exp is fully open-weight and available under an MIT license on developer platforms (Hugging Face/GitHub) ^[7] ^[8], enabling anyone to download or self-host it.
How it works: The “sparse attention” uses a “lightning indexer” plus a fine-grained token selector to pick only the most relevant parts of a huge input ^[9] ^[10]. This cuts the quadratic compute needed by standard Transformers, preserving output quality while trimming energy and latency ^[11] ^[12].
Performance: Reports say V3.2-Exp largely matches its predecessor (V3.1-Terminus) on key benchmarks ^[13], while cutting token costs from ~$0.07 to ~$0.028 per million (input cache hits) ^[14]. However, DeepSeek’s models still rank below top-tier AIs like GPT-5 or Anthropic’s Claude on overall “intelligence” tests ^[15] ^[16].
Strategic context: DeepSeek calls V3.2-Exp “an intermediate step toward our next-generation architecture” ^[17]. Notably, the model is built to run on Chinese AI chips (e.g. Huawei Ascend, Cambricon) “right out of the box” ^[18] ^[19], aligning with Beijing’s push for homegrown hardware amid U.S. export bans.
Expert views: Analysts welcome the cost savings. Futurum Group’s Nick Patience says the model “should make [AI] faster and more cost-effective … without a noticeable drop in performance” ^[20]. But others, like BlankPage Capital’s Ekaterina Almasque, warn that sparse methods “cut out things you think are not important” – with no guarantee the model won’t drop truly relevant data ^[21].

DeepSeek’s New V3.2-Exp Model

Hangzhou-based DeepSeek burst onto the AI scene earlier in 2025 with its R1 model (a heavily RL-trained chatbot) ^[22]. This time, DeepSeek’s announcement focuses on efficiency. On Sept. 29 the company published a post (on Hugging Face) unveiling DeepSeek-V3.2-Exp, an experimental large language model built on its V3 series ^[23] ^[24]. According to DeepSeek, V3.2-Exp maintains similar reasoning performance to V3.1 but uses far less compute for long inputs. The key innovation is a “DeepSeek Sparse Attention” (DSA) mechanism: rather than comparing every token to every other in a long document (the dense attention used by vanilla Transformers), DSA first uses a “lightning indexer” to pick out important excerpts, then a fine-grained selector to zoom in on the most salient words inside them ^[25] ^[26]. This two-stage pruning means the model can “handle a large amount of data” more cheaply, processing tens of thousands of tokens without exploding costs ^[27] ^[28].

DeepSeek’s announcement on Hugging Face explicitly calls V3.2-Exp an “intermediate step toward our next-generation architecture” ^[29]. In practice, it built V3.2-Exp by adding DSA on top of its V3.1-Terminus model (itself a refinement of V3.1) ^[30]. The company also released the full model weights and code under an open-source license (MIT) on Hugging Face and GitHub ^[31] ^[32], continuing its commitment to transparency. As VentureBeat notes, anyone can now download, modify, or deploy V3.2-Exp without fees ^[33]. DeepSeek even provides optimized kernels (via LMSYS and vLLM) to run the sparse model across contexts up to 128K tokens ^[34].

How “Sparse Attention” Slashes Costs

Transformer models like ChatGPT normally pay a steep price for long texts. Classic self-attention scales quadratically with context length – doubling the text more than doubles the work. As a result, “longer sequences – tens of thousands or even over 100,000 tokens – cause costs to rise much faster than the token count alone would suggest” ^[35]. Sparse Attention tackles this by effectively ignoring irrelevant content. DeepSeek describes DSA as using a lightning indexer to score chunks of the input, then loading only the most useful tokens into the attention window ^[36] ^[37]. In experiments, this “selective attention” cut the compute per token dramatically while preserving almost the same answer quality ^[38]. As one report explains, “by reducing the compute burden per token at large context lengths, V3.2-Exp keeps the cost curve flatter and much lower” ^[39]. In practice, this means tasks like summarizing a 100-page document or chatting with full history become far more affordable.

DeepSeek’s new model uses this efficiency not only in inference but also in training and fine-tuning. The company’s published paper (linked on Hugging Face) details the indexer and token-selector design ^[40]. In effect, DSA causes the model to “skip irrelevant data,” as Hugging Face’s Adina Yakefu (Chinese community lead) notes, boosting speed and lowering energy use ^[41] ^[42]. Internally, the firm combined these architectural changes with more advanced distillation and reinforcement-learning steps, but the headline is that V3.2-Exp can process very long contexts (up to 128K tokens) without the runaway costs a normal Transformer would incur ^[43] ^[44].

Performance and Cost Savings

Despite its radical new design, V3.2-Exp delivers nearly the same accuracy as its predecessor on standard benchmarks. VentureBeat reports that the model “mostly matches or slightly improves the benchmarks” of V3.1-Terminus ^[45]. In held-out tests, scores on tasks like reasoning, coding, and Q&A were essentially flat compared to V3.1 ^[46]. This implies DeepSeek achieved its goal: maintain performance while cutting resource use. (Notably, DeepSeek’s V3 series still trails leading AIs in raw capability; for example, V3.1 ranks behind OpenAI’s GPT-5 and Anthropic’s Claude in recent rankings ^[47].)

The real difference comes in price. DeepSeek publicly slashed its API pricing with V3.2-Exp. Under the new scheme, one million input tokens costs about $0.028 (for cache hits) versus $0.07 before ^[48] – roughly a 60% cut. (Output tokens are also cheaper.) Reuters notes that DeepSeek’s official announcements claim “API prices by 50%+” reduction ^[49]. In long-context applications, internal tests showed typical per-request costs falling by half or more ^[50] ^[51]. Industry comparisons list DeepSeek’s API now among the cheapest; only something like OpenAI’s tiny “GPT-5 Nano” (not full GPT-5) is lower per token ^[52].

In practical terms, users can now afford to run deep-learning tasks on far longer texts before costs spike out of control ^[53] ^[54]. For example, summarizing a 50-page report or maintaining a huge chat history is now “far more practical and affordable” ^[55]. DeepSeek and venture analysts highlight that this could open powerful AI to smaller developers. As Futurum Group researcher Nick Patience tells CNBC, the innovation should make the model “faster and more cost-effective to use without a noticeable drop in performance” ^[56], expanding access to those who couldn’t afford pricier models.

China’s AI Push and Strategic Impact

The launch of V3.2-Exp comes amid a heated tech rivalry. China is pushing its firms to break free of foreign chips in AI, and DeepSeek is aligning with this policy. Bloomberg notes the startup said it’s working “with Chinese chipmakers on the model” ^[57]. Indeed, DeepSeek confirmed V3.2-Exp runs natively on homegrown AI processors (such as Huawei’s Ascend and Cambricon) “right out of the box” ^[58]. This matters because U.S. bans (by both Biden and Trump administrations) have restricted Nvidia’s top AI chips to China ^[59], forcing Chinese tech to rely on domestic semiconductors. By making its model co-design with local hardware, DeepSeek helps Beijing’s goal of AI self-sufficiency.

Strategically, the move also fuels a domestic price war among Chinese AI providers. DeepSeek’s dramatic price cuts (to ~$0.03 per 1K tokens) give it a competitive edge over other local models (e.g. Alibaba’s Qwen series) and even over some global offerings ^[60] ^[61]. Wired-style comparisons note that Chinese companies are keenly watching DeepSeek: its R1 earlier showed Chinese teams could train advanced LLMs cheaply ^[62], and now V3.2-Exp may teach even U.S. firms new tricks about efficiency ^[63]. Authorities in Europe and the U.S. have even barred government use of DeepSeek due to security concerns, underscoring how seriously these models are taken ^[64]. DeepSeek’s founder himself seems aware of the geopolitical angle: the blog post repeatedly frames the work as research into “more efficient transformer architectures” – a domain of intense global competition ^[65].

Importantly, DeepSeek is not alone in exploring sparse techniques. Even OpenAI experimented with sparse attention years ago ^[66]. But by shipping an open-source implementation at scale, DeepSeek ensures the community (and rivals) will test and improve on it. As one analyst puts it, “people will always go for what is cheap, reliable, and effective,” and DeepSeek seems determined to be that option ^[67]. Huawei Cloud quickly announced it had already “completed the adaptation” of V3.2-Exp to its services ^[68], signaling broad industry uptake.

Expert Perspectives and Outlook

Most experts applaud the reduced costs but urge caution. As Futurum’s Patience notes, cheaper inference “opens up powerful AI tools to developers who can’t afford more expensive models” ^[69]. That democratization is attractive, but the flip side is risk. BlankPage’s Ekaterina Almasque warns that sparse attention “cuts out things you think are not important,” and there’s no guarantee it isn’t accidentally dropping really important details ^[70]. In other words, efficiency gains may come at a cost in nuance. Early reports from third-party evaluations will be crucial to verify DeepSeek’s claims.

Some see V3.2-Exp as a tactical move. DeepSeek itself calls it “an intermediate step” ^[71]. Cryptopolitan notes the company is “playing the long game” by continuing to feed the open-source community ^[72]. Investors and users will watch for what comes next – perhaps a V3.3 or V4 that combines this cost-cutting with a capability boost. For now, DeepSeek-V3.2-Exp stands as a symbol of the shifting AI arms race: it shows that beyond raw power, efficiency and cost matter hugely. As one tech editor put it, even if V3.2-Exp doesn’t dethrone GPT-5, it might “teach U.S. providers some much needed tricks” for cheaper AI services ^[73].

Sources: DeepSeek’s own Hugging Face post and research paper ^[74] ^[75]; reporting by TechCrunch, Reuters, Bloomberg, Euronews, WSJ/Hindustan Times, and VentureBeat ^[76] ^[77] ^[78] ^[79]; expert comments from CNBC and Cryptopolitan coverage ^[80] ^[81]. These sources detail the model’s design, claimed cost-savings, and industry reactions.