Files changed (2) hide show
  1. README.md +32 -0
  2. config_vllm.json +38 -0
README.md CHANGED
@@ -137,6 +137,38 @@ print(scores.tolist())
137
 
138
  ```
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ### **Software Integration**
141
 
142
  **Runtime Engine:** Llama Nemotron embedding NIM
 
137
 
138
  ```
139
 
140
+ #### vLLM Usage
141
+
142
+ 1. Ensure you are using `vllm==0.11.0`.
143
+ 2. Clone [this model's repository](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2/tree/main). Overwrite `config.json` with `config_vllm.json`.
144
+ 3. Start the vLLM server with the following command (replace the `<path_to_the_cloned_repository>` and `<num_gpus_to_use>` with your values):
145
+ ```
146
+ vllm serve \
147
+ <path_to_the_cloned_repository> \
148
+ --trust-remote-code \
149
+ --runner pooling \
150
+ --model-impl vllm \
151
+ --override-pooler-config '{\"pooling_type\": \"MEAN\"}' \
152
+ --data-parallel-size <num_gpus_to_use> \
153
+ --dtype float32 \
154
+ --port 8000
155
+ ```
156
+
157
+ You can now access the model using the OpenAI sdk, for instance:
158
+
159
+ ```
160
+ from openai import OpenAI
161
+ client = OpenAI(base_url="http://localhost:8000/v1")
162
+ models = client.models.list()
163
+ model_name = models.data[0].id
164
+
165
+ response = client.embeddings.create(
166
+ input=['query: how much protein should a female eat'],
167
+ model=model_name
168
+ )
169
+ response.data[0].embedding
170
+ ```
171
+
172
  ### **Software Integration**
173
 
174
  **Runtime Engine:** Llama Nemotron embedding NIM
config_vllm.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nvidia/llama-3.2-nv-embedqa-1b-v2",
3
+ "architectures": [
4
+ "LlamaModel"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 128000,
9
+ "eos_token_id": 128001,
10
+ "head_dim": 64,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 2048,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 8192,
15
+ "max_position_embeddings": 131072,
16
+ "is_causal": false,
17
+ "mlp_bias": false,
18
+ "model_type": "llama",
19
+ "num_attention_heads": 32,
20
+ "num_hidden_layers": 16,
21
+ "num_key_value_heads": 8,
22
+ "pooling": "avg",
23
+ "pretraining_tp": 1,
24
+ "rms_norm_eps": 1e-05,
25
+ "rope_scaling": {
26
+ "factor": 32.0,
27
+ "high_freq_factor": 4.0,
28
+ "low_freq_factor": 1.0,
29
+ "original_max_position_embeddings": 8192,
30
+ "rope_type": "llama3"
31
+ },
32
+ "rope_theta": 500000.0,
33
+ "tie_word_embeddings": true,
34
+ "torch_dtype": "bfloat16",
35
+ "transformers_version": "4.44.2",
36
+ "use_cache": true,
37
+ "vocab_size": 128256
38
+ }