feihu.hf
commited on
Commit
·
39d9315
1
Parent(s):
30c8f30
update README
Browse files
README.md
CHANGED
|
@@ -31,6 +31,59 @@ Qwen3 is the latest generation of large language models in Qwen series, offering
|
|
| 31 |
- Number of Layers: 28
|
| 32 |
- Number of Attention Heads (GQA): 16 for Q and 8 for KV
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## Best Practices
|
| 36 |
|
|
|
|
| 31 |
- Number of Layers: 28
|
| 32 |
- Number of Attention Heads (GQA): 16 for Q and 8 for KV
|
| 33 |
|
| 34 |
+
- Context Length: 32,768.
|
| 35 |
+
- Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0
|
| 36 |
+
|
| 37 |
+
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
|
| 38 |
+
|
| 39 |
+
## Quickstart
|
| 40 |
+
|
| 41 |
+
### llama.cpp
|
| 42 |
+
|
| 43 |
+
Check out our [llama.cpp documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html) for more usage guide.
|
| 44 |
+
|
| 45 |
+
We advise you to clone [`llama.cpp`](https://github.com/ggerganov/llama.cpp) and install it following the official guide. We follow the latest version of llama.cpp.
|
| 46 |
+
In the following demonstration, we assume that you are running commands under the repository `llama.cpp`.
|
| 47 |
+
|
| 48 |
+
```shell
|
| 49 |
+
./llama-cli -hf Qwen/Qwen3-1.7B:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### ollama
|
| 53 |
+
|
| 54 |
+
Check out our [ollama documentation](https://qwen.readthedocs.io/en/latest/run_locally/ollama.html) for more usage guide.
|
| 55 |
+
|
| 56 |
+
You can run Qwen3 with one command:
|
| 57 |
+
|
| 58 |
+
```shell
|
| 59 |
+
ollama run hf.co/Qwen/Qwen3-1.7B-GGUF:Q8_0
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Switching Between Thinking and Non-Thinking Mode
|
| 63 |
+
|
| 64 |
+
You can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
|
| 65 |
+
|
| 66 |
+
Here is an example of multi-turn conversation:
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
> Who are you /no_think
|
| 70 |
+
|
| 71 |
+
<think>
|
| 72 |
+
|
| 73 |
+
</think>
|
| 74 |
+
|
| 75 |
+
I am Qwen, a large-scale language model developed by Alibaba Cloud. [...]
|
| 76 |
+
|
| 77 |
+
> How many 'r's are in 'strawberries'? /think
|
| 78 |
+
|
| 79 |
+
<think>
|
| 80 |
+
Okay, let's see. The user is asking how many times the letter 'r' appears in the word "strawberries". [...]
|
| 81 |
+
</think>
|
| 82 |
+
|
| 83 |
+
The word strawberries contains 3 instances of the letter r. [...]
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
|
| 87 |
|
| 88 |
## Best Practices
|
| 89 |
|