Files changed (1) hide show
  1. README.md +164 -150
README.md CHANGED
@@ -1,151 +1,165 @@
1
- ---
2
- library_name: peft
3
- license: apache-2.0
4
- base_model: Qwen/Qwen2.5-1.5B
5
- tags:
6
- - generated_from_trainer
7
- metrics:
8
- - accuracy
9
- model-index:
10
- - name: plateer_classifier_test
11
- results: []
12
- ---
13
-
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
- # plateer_classifier_test
18
-
19
- This model is a fine-tuned version of [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) on [x2bee/plateer_category_data](https://huggingface.co/datasets/x2bee/plateer_category_data).
20
- It achieves the following results on the evaluation set:
21
- - [MLflow Result(https://polar-mlflow.x2bee.com/#/experiments/27/runs/baa7269894b14f91b8a8ea3822474476)]
22
- - Loss: 0.3242
23
- - Accuracy: 0.8997
24
-
25
- ## How To use
26
- #### Load Base Model and Plateer Classifier Model.
27
- ```python
28
- import joblib;
29
- from huggingface_hub import hf_hub_download;
30
- from peft import PeftModel, PeftConfig;
31
- from transformers import AutoTokenizer, TextClassificationPipeline, AutoModelForSequenceClassification;
32
- from huggingface_hub import HfApi, login
33
- with open('./api_key/HGF_TOKEN.txt', 'r') as hgf:
34
- login(token=hgf.read())
35
- api = HfApi()
36
- repo_id = "x2bee/plateer_classifier_v0.1"
37
- data_id = "x2bee/plateer_category_data"
38
-
39
- # Load Config, Tokenizer, Label_Encoder
40
- config = PeftConfig.from_pretrained(repo_id, subfolder="last-checkpoint")
41
- tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="last-checkpoint")
42
- label_encoder_file = hf_hub_download(repo_id=data_id, repo_type="dataset", filename="label_encoder.joblib")
43
- label_encoder = joblib.load(label_encoder_file)
44
-
45
- # Load base_model
46
- base_model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-1.5B", num_labels=17)
47
- base_model.resize_token_embeddings(len(tokenizer))
48
-
49
- # Load Model
50
- model = PeftModel.from_pretrained(base_model, repo_id, subfolder="last-checkpoint")
51
-
52
- import torch
53
- class TextClassificationPipeline(TextClassificationPipeline):
54
- def __call__(self, inputs, top_k=5, **kwargs):
55
- inputs = self.tokenizer(inputs, return_tensors="pt", truncation=True, padding=True, **kwargs)
56
- inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
57
-
58
- with torch.no_grad():
59
- outputs = self.model(**inputs)
60
-
61
- probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
62
- scores, indices = torch.topk(probs, top_k, dim=-1)
63
-
64
- results = []
65
- for batch_idx in range(indices.shape[0]):
66
- batch_results = []
67
- for score, idx in zip(scores[batch_idx], indices[batch_idx]):
68
- temp_list = []
69
- label = self.model.config.id2label[idx.item()]
70
- label = int(label.split("_")[1])
71
- temp_list.append(label)
72
- predicted_class = label_encoder.inverse_transform(temp_list)[0]
73
-
74
- batch_results.append({
75
- "label": label,
76
- "label_decode": predicted_class,
77
- "score": score.item(),
78
- })
79
- results.append(batch_results)
80
-
81
- return results
82
-
83
- classifier_model = TextClassificationPipeline(tokenizer=tokenizer, model=model)
84
-
85
- def plateer_classifier(text, top_k=3):
86
- result = classifier_model(text, top_k=top_k)
87
- return result
88
- ```
89
-
90
- #### Run
91
- ```python
92
- user_input = "머리띠"
93
- result = plateer_classifier(user_input)[0]
94
- print(result)
95
- ```
96
-
97
- ```bash
98
- {'label': 6, 'label_decode': '뷰티/케어', 'score': 0.42996299266815186}
99
- {'label': 15, 'label_decode': '패션/의류/잡화', 'score': 0.1485249102115631}
100
- {'label': 8, 'label_decode': '스포츠', 'score': 0.1281907707452774}
101
- ```
102
-
103
-
104
- More information needed
105
-
106
- ## Intended uses & limitations
107
-
108
- More information needed
109
-
110
- ## Training and evaluation data
111
-
112
- More information needed
113
-
114
- ## Training procedure
115
-
116
- ### Training hyperparameters
117
-
118
- The following hyperparameters were used during training:
119
- - learning_rate: 0.0002
120
- - train_batch_size: 8
121
- - eval_batch_size: 8
122
- - seed: 42
123
- - distributed_type: multi-GPU
124
- - num_devices: 4
125
- - gradient_accumulation_steps: 4
126
- - total_train_batch_size: 128
127
- - total_eval_batch_size: 32
128
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
129
- - lr_scheduler_type: linear
130
- - lr_scheduler_warmup_steps: 10000
131
- - num_epochs: 1
132
- - mixed_precision_training: Native AMP
133
-
134
- ### Training results
135
-
136
- | Training Loss | Epoch | Step | Validation Loss | Accuracy |
137
- |:-------------:|:------:|:------:|:---------------:|:--------:|
138
- | 0.5023 | 0.0292 | 5000 | 0.5044 | 0.8572 |
139
- | 0.4629 | 0.0585 | 10000 | 0.4571 | 0.8688 |
140
- | 0.4254 | 0.0878 | 15000 | 0.4201 | 0.8770 |
141
- | 0.4025 | 0.1171 | 20000 | 0.4016 | 0.8823 |
142
- | 0.3635 | 0.3220 | 55000 | 0.3623 | 0.8905 |
143
- | 0.3192 | 0.6441 | 110000 | 0.3242 | 0.8997 |
144
-
145
- ### Framework versions
146
-
147
- - PEFT 0.13.2
148
- - Transformers 4.46.3
149
- - Pytorch 2.2.1
150
- - Datasets 3.1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  - Tokenizers 0.20.3
 
1
+ ---
2
+ library_name: peft
3
+ license: apache-2.0
4
+ base_model: Qwen/Qwen2.5-1.5B
5
+ tags:
6
+ - generated_from_trainer
7
+ metrics:
8
+ - accuracy
9
+ language:
10
+ - zho
11
+ - eng
12
+ - fra
13
+ - spa
14
+ - por
15
+ - deu
16
+ - ita
17
+ - rus
18
+ - jpn
19
+ - kor
20
+ - vie
21
+ - tha
22
+ - ara
23
+ model-index:
24
+ - name: plateer_classifier_test
25
+ results: []
26
+ ---
27
+
28
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
+ should probably proofread and complete it, then remove this comment. -->
30
+
31
+ # plateer_classifier_test
32
+
33
+ This model is a fine-tuned version of [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) on [x2bee/plateer_category_data](https://huggingface.co/datasets/x2bee/plateer_category_data).
34
+ It achieves the following results on the evaluation set:
35
+ - [MLflow Result(https://polar-mlflow.x2bee.com/#/experiments/27/runs/baa7269894b14f91b8a8ea3822474476)]
36
+ - Loss: 0.3242
37
+ - Accuracy: 0.8997
38
+
39
+ ## How To use
40
+ #### Load Base Model and Plateer Classifier Model.
41
+ ```python
42
+ import joblib;
43
+ from huggingface_hub import hf_hub_download;
44
+ from peft import PeftModel, PeftConfig;
45
+ from transformers import AutoTokenizer, TextClassificationPipeline, AutoModelForSequenceClassification;
46
+ from huggingface_hub import HfApi, login
47
+ with open('./api_key/HGF_TOKEN.txt', 'r') as hgf:
48
+ login(token=hgf.read())
49
+ api = HfApi()
50
+ repo_id = "x2bee/plateer_classifier_v0.1"
51
+ data_id = "x2bee/plateer_category_data"
52
+
53
+ # Load Config, Tokenizer, Label_Encoder
54
+ config = PeftConfig.from_pretrained(repo_id, subfolder="last-checkpoint")
55
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="last-checkpoint")
56
+ label_encoder_file = hf_hub_download(repo_id=data_id, repo_type="dataset", filename="label_encoder.joblib")
57
+ label_encoder = joblib.load(label_encoder_file)
58
+
59
+ # Load base_model
60
+ base_model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-1.5B", num_labels=17)
61
+ base_model.resize_token_embeddings(len(tokenizer))
62
+
63
+ # Load Model
64
+ model = PeftModel.from_pretrained(base_model, repo_id, subfolder="last-checkpoint")
65
+
66
+ import torch
67
+ class TextClassificationPipeline(TextClassificationPipeline):
68
+ def __call__(self, inputs, top_k=5, **kwargs):
69
+ inputs = self.tokenizer(inputs, return_tensors="pt", truncation=True, padding=True, **kwargs)
70
+ inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
71
+
72
+ with torch.no_grad():
73
+ outputs = self.model(**inputs)
74
+
75
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
76
+ scores, indices = torch.topk(probs, top_k, dim=-1)
77
+
78
+ results = []
79
+ for batch_idx in range(indices.shape[0]):
80
+ batch_results = []
81
+ for score, idx in zip(scores[batch_idx], indices[batch_idx]):
82
+ temp_list = []
83
+ label = self.model.config.id2label[idx.item()]
84
+ label = int(label.split("_")[1])
85
+ temp_list.append(label)
86
+ predicted_class = label_encoder.inverse_transform(temp_list)[0]
87
+
88
+ batch_results.append({
89
+ "label": label,
90
+ "label_decode": predicted_class,
91
+ "score": score.item(),
92
+ })
93
+ results.append(batch_results)
94
+
95
+ return results
96
+
97
+ classifier_model = TextClassificationPipeline(tokenizer=tokenizer, model=model)
98
+
99
+ def plateer_classifier(text, top_k=3):
100
+ result = classifier_model(text, top_k=top_k)
101
+ return result
102
+ ```
103
+
104
+ #### Run
105
+ ```python
106
+ user_input = "머리띠"
107
+ result = plateer_classifier(user_input)[0]
108
+ print(result)
109
+ ```
110
+
111
+ ```bash
112
+ {'label': 6, 'label_decode': '뷰티/케어', 'score': 0.42996299266815186}
113
+ {'label': 15, 'label_decode': '패션/의류/잡화', 'score': 0.1485249102115631}
114
+ {'label': 8, 'label_decode': '스포츠', 'score': 0.1281907707452774}
115
+ ```
116
+
117
+
118
+ More information needed
119
+
120
+ ## Intended uses & limitations
121
+
122
+ More information needed
123
+
124
+ ## Training and evaluation data
125
+
126
+ More information needed
127
+
128
+ ## Training procedure
129
+
130
+ ### Training hyperparameters
131
+
132
+ The following hyperparameters were used during training:
133
+ - learning_rate: 0.0002
134
+ - train_batch_size: 8
135
+ - eval_batch_size: 8
136
+ - seed: 42
137
+ - distributed_type: multi-GPU
138
+ - num_devices: 4
139
+ - gradient_accumulation_steps: 4
140
+ - total_train_batch_size: 128
141
+ - total_eval_batch_size: 32
142
+ - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
143
+ - lr_scheduler_type: linear
144
+ - lr_scheduler_warmup_steps: 10000
145
+ - num_epochs: 1
146
+ - mixed_precision_training: Native AMP
147
+
148
+ ### Training results
149
+
150
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy |
151
+ |:-------------:|:------:|:------:|:---------------:|:--------:|
152
+ | 0.5023 | 0.0292 | 5000 | 0.5044 | 0.8572 |
153
+ | 0.4629 | 0.0585 | 10000 | 0.4571 | 0.8688 |
154
+ | 0.4254 | 0.0878 | 15000 | 0.4201 | 0.8770 |
155
+ | 0.4025 | 0.1171 | 20000 | 0.4016 | 0.8823 |
156
+ | 0.3635 | 0.3220 | 55000 | 0.3623 | 0.8905 |
157
+ | 0.3192 | 0.6441 | 110000 | 0.3242 | 0.8997 |
158
+
159
+ ### Framework versions
160
+
161
+ - PEFT 0.13.2
162
+ - Transformers 4.46.3
163
+ - Pytorch 2.2.1
164
+ - Datasets 3.1.0
165
  - Tokenizers 0.20.3