Imported model from jaxmef repo

Files changed (7) hide show

README.md +101 -3
config.json +25 -0
onnx/model.onnx +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +57 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,101 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+    - liamdugan/raid
+metrics:
+    - accuracy
+    - f1
+    - roc_auc
+base_model:
+    - intfloat/e5-small
+    - MayZhou/e5-small-lora-ai-generated-detector
+model-index:
+    - name: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
+      results:
+          - task:
+                type: text-classification
+            dataset:
+                name: RAID-test
+                type: RAID-test
+            metrics:
+                - name: accuracy
+                  type: accuracy
+                  value: 0.939
+            source:
+                name: RAID Benchmark Leaderboard
+                url: https://raid-bench.xyz/leaderboard
+pipeline_tag: text-classification
+---
+# LoRA Fine-Tuned AI-generated Detector
+> Disclaimer
+>
+> This ONNX model was converted from the original model available in [safetensors format](https://huggingface.co/MayZhou/e5-small-lora-ai-generated-detector). The conversion was performed to enable compatibility with frameworks or tools that utilize ONNX models.
+>
+> Please note that this repository is not affiliated with the creators of the original model. All credit for the model’s development belongs to the original authors. To access the original model, please visit: [Original Model Link](https://huggingface.co/MayZhou/e5-small-lora-ai-generated-detector).
+>
+> If you have any questions about the original model, its licensing, or usage, please refer to the source link provided above.
+This is a e5-small model fine-tuned with LoRA for sequence classification tasks. It is optimized to classify text into AI-generated or human-written with high accuracy.
+-   **Label_0**: Represents **human-written** content.
+-   **Label_1**: Represents **AI-generated** content.
+## Model Details
+-   **Base Model**: `intfloat/e5-small`
+-   **Fine-Tuning Technique**: LoRA (Low-Rank Adaptation)
+-   **Task**: Sequence Classification
+-   **Use Cases**: Text classification for AI-generated detection.
+-   **Hyperparameters**:
+    -   Learning rate: `5e-5`
+    -   Epochs: `3`
+    -   LoRA rank: `8`
+    -   LoRA alpha: `16`
+## Training Details
+-   **Dataset**:
+    -   10,000 twitters and 10,000 rewritten twitters with GPT-4o-mini.
+    -   80,000 human-written text from [RAID-train](https://github.com/liamdugan/raid).
+    -   128,000 AI-generated text from [RAID-train](https://github.com/liamdugan/raid).
+-   **Hardware**: Fine-tuned on a single NVIDIA A100 GPU.
+-   **Training Time**: Approximately 2 hours.
+-   **Evaluation Metrics**:
+    | Metric | (Raw) E5-small | Fine-tuned |
+    |--------|---------------:|-----------:|
+    |Accuracy| 65.2% | 89.0% |
+    |F1 Score| 0.653 | 0.887 |
+    | AUC | 0.697 | 0.976 |
+## Collaborators
+-   **Menglin Zhou**
+-   **Jiaping Liu**
+-   **Xiaotian Zhan**
+## Citation
+If you use this model, please cite the RAID dataset as follows:
+```
+@inproceedings{dugan-etal-2024-raid,
+    title = "{RAID}: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors",
+    author = "Dugan, Liam  and
+      Hwang, Alyssa  and
+      Trhl{\'\i}k, Filip  and
+      Zhu, Andrew  and
+      Ludan, Josh Magnus  and
+      Xu, Hainiu  and
+      Ippolito, Daphne  and
+      Callison-Burch, Chris",
+    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    month = aug,
+    year = "2024",
+    address = "Bangkok, Thailand",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2024.acl-long.674",
+    pages = "12463--12492",
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_attn_implementation_autoset": true,
+  "_name_or_path": "MayZhou/e5-small-lora-ai-generated-detector",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.46.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:688d586cfae7583fa97656330144c99a113a972da8d0df1358e4c2220083c420
+size 133745403

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff