{"id":142,"date":"2025-05-05T18:22:00","date_gmt":"2025-05-05T10:22:00","guid":{"rendered":"https:\/\/blog.liu-qi.cn\/?p=142"},"modified":"2026-04-18T21:53:51","modified_gmt":"2026-04-18T13:53:51","slug":"%e4%b8%8d%e7%9f%a5%e9%83%a8%e7%bd%b2%e5%93%aa%e4%b8%aa%e7%89%88%e6%9c%ac%ef%bc%9f%e4%b8%80%e6%96%87%e7%9c%8b%e6%87%82qwen3%e6%9c%ac%e5%9c%b0%e9%83%a8%e7%bd%b2%e7%9a%84%e9%85%8d%e7%bd%ae%e8%a6%81","status":"publish","type":"post","link":"https:\/\/en.blog.liu-qi.cn\/2025\/05\/05\/%e4%b8%8d%e7%9f%a5%e9%83%a8%e7%bd%b2%e5%93%aa%e4%b8%aa%e7%89%88%e6%9c%ac%ef%bc%9f%e4%b8%80%e6%96%87%e7%9c%8b%e6%87%82qwen3%e6%9c%ac%e5%9c%b0%e9%83%a8%e7%bd%b2%e7%9a%84%e9%85%8d%e7%bd%ae%e8%a6%81\/","title":{"rendered":"Not Sure Which Version to Deploy? A Guide to Qwen3 Local Deployment Configuration Requirements"},"content":{"rendered":"<p>In the previous article, we tested across 3 scenarios and found that the smaller Qwen3 models significantly outperformed DeepSeek models of comparable size that were distilled using R1 reasoning data.<\/p>\n<p><a href=\"https:\/\/blog.liu-qi.cn\/2025\/05\/01\/qwen3%E5%80%BC%E4%B8%8D%E5%80%BC%E5%BE%97%E6%99%AE%E9%80%9A%E7%94%A8%E6%88%B7%E6%9C%AC%E5%9C%B0%E9%83%A8%E7%BD%B2%EF%BC%9F3%E4%B8%AA%E8%90%BD%E5%9C%B0%E5%9C%BA%E6%99%AF%EF%BC%8C30%E9%81%93%E9%A2%98\/\">Is Qwen3 worth local deployment for average users? 3 real-world scenarios, 30 questions, 300 responses, 10 models in a mixed test, scored by Doubao AI!<\/a><\/p>\n<p>Over the past few days, friends on WeChat Official Accounts and Zhihu have been asking questions like &#8216;How many billion parameters can my configuration run?&#8217; and &#8216;Can I deploy a higher-precision quantized model?&#8217; Let&#8217;s discuss this today.<\/p>\n<p>Before diving into VRAM usage, let&#8217;s first understand some foundational information and concepts.<\/p>\n<h2>Model Size<\/h2>\n<p>The Qwen3 open-source series includes 8 different sizes; larger sizes require more VRAM.<\/p>\n<p>Among the 8 models, 6 are Dense models and 2 are MoE (Mixture of Experts) models. Dense models activate all parameters during inference, while MoE models use a sparse activation strategy, only activating a subset of expert parameters per forward pass, offering higher performance within a limited computational budget.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/001-3e7e69ab479c.png\" \/><\/p>\n<h2>Quantized Weights<\/h2>\n<p>Quantization is a technique that reduces the numerical precision of model weights to significantly decrease VRAM usage and storage space, and may improve inference speed. Unquantized models have very high VRAM requirements and are difficult to deploy locally.<\/p>\n<p>Ollama, the most convenient tool for local deployment, offers Qwen3 models with three quantized weight options: Q4_K_M (default), Q8_0, and FP16.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/002-d1809717c3fb.png\" \/><\/p>\n<h2>KV Cache<\/h2>\n<p>KV cache is a key technical concept in the inference process of large language models. Simply put, without KV cache, generating each new token would require recalculating attention for the entire sequence, making computation grow exponentially. With KV cache, only the new token&#8217;s vector needs to be computed and interacted with the cache, greatly reducing computational load.<\/p>\n<p>KV cache is crucial for making large language models practical but is also a major source of VRAM consumption. Its size grows linearly with context length\u2014longer contexts mean higher VRAM usage.<\/p>\n<p>The Qwen3 models have a native context length of 32K, and sizes 4B and above can be extended to 128K. However, this context length is often impractical for consumer-grade GPUs; typically, high-end GPUs (24GB+) can only handle 8K-16K context.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/003-341f53ff4c58.png\" \/><\/p>\n<p>Now that we&#8217;ve covered these three foundational pieces of information and concepts, let&#8217;s move on to VRAM.<\/p>\n<h2>VRAM Usage<\/h2>\n<p>When running large language models locally, VRAM usage primarily comes from three components: model weights (including parameter size and quantization), KV cache, and activations\/overhead.<\/p>\n<p>Model weights refer to the space needed to store\/load model parameters, which depends on the model&#8217;s parameter count and the numerical precision used (i.e., quantization level). For example, a model with 14 billion parameters stored in FP16 (half-precision floating point, 2 bytes per parameter) would require approximately 28GB of VRAM.<\/p>\n<p>KV cache is closely related to factors such as sequence length (context length), batch size, model dimensions (number of layers, hidden layer size), and cache precision (which doesn&#8217;t have to match model weight precision and is typically FP16). It can be estimated using the formula: VRAM_kvcache \u2248 2 \u00d7 number of layers \u00d7 hidden layer dimension \u00d7 sequence length \u00d7 batch size \u00d7 bytes per value. We won&#8217;t elaborate further here.<\/p>\n<p>Activations and overhead refer to the intermediate computation results (activations) during inference and the VRAM used by the runtime framework itself (e.g., CUDA kernels, drivers, operating system), typically around 1-2GB.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/004-0f93992f3cec.png\" \/><\/p>\n<p>For the Qwen3 models, you can refer to the table below, where context length is uniformly calculated as 8K.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/005-ce182a26109b.png\" \/><\/p>\n<p>For example, deploying the Q4_K_M quantized Qwen3-32B model requires approximately 19.8GB (model weights) + about 14GB (8K context KV cache) + about 1-2GB (overhead), totaling roughly 35.3GB.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/006-b9b2bf10c987.png\" \/><\/p>\n<p>Of course, to preempt skepticism: in actual use, context varies and accumulates gradually based on real situations\u2014it doesn&#8217;t immediately occupy the full ~14GB space. So even if total usage doesn&#8217;t reach the table&#8217;s figures, it doesn&#8217;t mean it&#8217;s unusable.<\/p>\n<p>However, if after local deployment, running even a simple initial task shows VRAM maxed out like this, it&#8217;s likely that the model will struggle with complex tasks. It&#8217;s best to switch to a smaller model or a lower-precision quantized version.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/007-47b986a34aa2.png\" \/><\/p>\n<p>Additionally, for Mixture of Experts models, a separate note is needed: in terms of VRAM usage, they aren&#8217;t much different from Dense models\u2014when loaded into VRAM, space must still be allocated for all parameters. But since only a subset of parameters is activated and used in actual computation, inference speed is much faster than a Dense model with the same VRAM footprint. When loading, consider total parameter count; for speed, consider activated parameter count.<\/p>\n<h2>Consumer-Grade GPU Compatibility<\/h2>\n<p>Typical compatibility between consumer-grade GPUs and models can be seen in the following diagram:<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/008-472404885fe1.png\" \/><\/p>\n<p>Theoretically, for high-end GPUs with 24GB VRAM, the Q4_K_M quantized Qwen3-30B-A3B is a very good fit in terms of VRAM usage. If you use a 3090\/4090, prioritize testing this model.<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/009-4f32794e7d4e.png\" \/><\/p>\n<p>Considering overall parameters, the following diagram provides guidance on choosing local Qwen3 models based on different configurations:<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/010-57840ada5d84.png\" \/><\/p>\n<h2>Different Sizes&amp;Actual Tests of Quantized Weight Qwen3 Models<\/h2>\n<p>Using the same test questions as the previous article:<a href=\"https:\/\/blog.liu-qi.cn\/2025\/05\/01\/qwen3%E5%80%BC%E4%B8%8D%E5%80%BC%E5%BE%97%E6%99%AE%E9%80%9A%E7%94%A8%E6%88%B7%E6%9C%AC%E5%9C%B0%E9%83%A8%E7%BD%B2%EF%BC%9F3%E4%B8%AA%E8%90%BD%E5%9C%B0%E5%9C%BA%E6%99%AF%EF%BC%8C30%E9%81%93%E9%A2%98\/\">Is Qwen3 worth local deployment for average users? 3 real-world scenarios, 30 questions, 300 responses, 10 models in a mixed test, scored by Doubao AI!<\/a><\/p>\n<p>We&#8217;ve added Qwen3 models of different sizes and quantized weights for your reference.<\/p>\n<p>(P.S.: This is not a rigorous test; results are for reference only. Model names without specified precision default to Q4_K_M.)<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/011-4d8b079649a8.png\" \/><\/p>\n<p>Copywriting Project:<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/012-e2d89c342f68.png\" \/><\/p>\n<p>Summarization Project:<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/013-6c13bcf5e425.png\" \/><\/p>\n<p>RuoZhiBa Project:<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/014-f6fc30f661f7.png\" \/><\/p>\n<p>Comprehensive Score:<\/p>\n<p><img decoding=\"async\" alt=\"Image\" loading=\"lazy\" src=\"https:\/\/blog.liu-qi.cn\/wp-content\/uploads\/2025\/05\/015-a6eb44fa034d.png\" \/><\/p>\n<p>Across these three projects, the full DeepSeek-V3 (via Silicon Flow API) still achieves the highest overall score. With the addition of the 0.6B small model, GLM-4-Flash, which had the lowest overall score in the previous test, now ranks third from last, but the gap between the 0.6B model and others isn&#8217;t as large as expected. You can refer to the models in between on your own.<\/p>\n<p>For detailed test content, see: https:\/\/ilovezhiwai.feishu.cn\/wiki\/TboCwTXPVi4vQqkHucuc7Dq4nkI<\/p>\n<p>Based on test results and theoretical data from these relatively high-frequency scenarios I commonly use, my personal preference would lean towards Qwen3-14B (Q8_0).<\/p>\n<h2>How to Deploy Locally<\/h2>\n<p>Regarding local deployment, I still recommend using Ollama\u2014it&#8217;s very quick and convenient. I won&#8217;t elaborate on the deployment tutorial here; for the steps, refer to this previous article on deploying DeepSeek distilled models:<\/p>\n<p><a href=\"https:\/\/blog.liu-qi.cn\/2025\/02\/03\/%E5%A6%82%E4%BD%95%E9%80%9Aollama%E6%9C%AC%E5%9C%B0%E9%83%A8%E7%BD%B2deepseek%E8%92%B8%E9%A6%8F%E6%A8%A1%E5%9E%8B%EF%BC%9F%E5%B0%8F%E7%99%BD%E7%9C%8B%E8%BF%99%E4%B8%80%E7%AF%87%E5%B0%B1%E5%A4%9F\/\">How to Deploy DeepSeek Distilled Models Locally via Ollama? This Guide is All You Need<\/a><\/p>\n<p>If you don&#8217;t prefer using Open-WebUI, you can switch to CherryStudio, Chatbox, etc., as long as the internal network IP and port address are correctly filled in.<\/p>\n<p>Holiday&#8217;s over, enjoy getting back to work~<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article explains the key configuration requirements, VRAM usage, and practical compatibility for deploying different Qwen3 models locally.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24],"tags":[21,20,10],"class_list":["post-142","post","type-post","status-publish","format-standard","hentry","category-articles","tag-ollama","tag-qwen","tag-10"],"_links":{"self":[{"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/posts\/142","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/comments?post=142"}],"version-history":[{"count":0,"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/posts\/142\/revisions"}],"wp:attachment":[{"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/media?parent=142"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/categories?post=142"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/en.blog.liu-qi.cn\/index.php\/wp-json\/wp\/v2\/tags?post=142"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}