Upload 7 files

Browse files

Files changed (7) hide show

README.md +242 -0
config (1).json +26 -0
model (2).safetensors +3 -0
special_tokens_map (1).json +7 -0
tokenizer (1).json +0 -0
tokenizer_config (1).json +56 -0
vocab (1).txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,242 @@

+# Sarcasm Detection with BERT
+This repository contains a fine-tuned BERT model for detecting sarcasm in headlines and text. The model achieves high accuracy in distinguishing between sarcastic and non-sarcastic content using natural language processing techniques.
+---
+## Model Details
+- **Model Name:** BERT-Base-Uncased Fine-tuned for Sarcasm Detection
+- **Model Architecture:** BERT Base (110M parameters)
+- **Task:** Binary Classification (Sarcastic vs Non-Sarcastic)
+- **Dataset:** Sarcasm Headlines Dataset
+- **Quantization:** Float16 (for optimized deployment)
+- **Fine-tuning Framework:** Hugging Face Transformers
+---
+## Dataset
+The model was trained on the **Sarcasm Headlines Dataset** which contains:
+- **Total Samples:** 26,709 headlines
+- **Features:**
+  - `headline`: The text content to classify
+  - `is_sarcastic`: Binary label (1 for sarcastic, 0 for non-sarcastic)
+- **Train/Test Split:** 90% training, 10% evaluation
+---
+## Performance Metrics
+| Epoch | Training Loss | Validation Loss | Accuracy |
+|-------|---------------|-----------------|----------|
+| 1     | 0.2048        | 0.1821          | 92.96%   |
+| 2     | 0.1138        | 0.2792          | 91.01%   |
+| 3     | 0.0586        | 0.2372          | **93.86%** |
+**Final Model Performance:**
+- **Best Accuracy:** 93.86%
+- **Final Training Loss:** 0.146
+---
+## Installation
+```bash
+pip install transformers datasets evaluate scikit-learn torch
+```
+---
+## Usage
+### Quick Start
+```python
+from transformers import pipeline
+import torch
+# Load the trained model
+classifier = pipeline("text-classification",
+                     model="./sarcasm_model",
+                     tokenizer="./sarcasm_model")
+# Test examples
+test_inputs = [
+    "I'm absolutely thrilled to be stuck in traffic again.",
+    "The weather is nice and sunny today.",
+    "Oh great, another email from the boss with more tasks."
+]
+for sentence in test_inputs:
+    result = classifier(sentence)[0]
+    label = "Sarcastic" if result["label"] == "LABEL_1" else "Not Sarcastic"
+    print(f"'{sentence}' → {label} (Confidence: {result['score']:.2f})")
+```
+### Manual Model Loading
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model = AutoModelForSequenceClassification.from_pretrained("./sarcasm_model")
+tokenizer = AutoTokenizer.from_pretrained("./sarcasm_model")
+# Tokenize input
+text = "Oh wonderful, another Monday morning!"
+inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
+# Inference
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class = outputs.logits.argmax(dim=1).item()
+label_mapping = {0: "Not Sarcastic", 1: "Sarcastic"}
+confidence = predictions[0][predicted_class].item()
+print(f"Prediction: {label_mapping[predicted_class]} (Confidence: {confidence:.2f})")
+```
+---
+## Training Configuration
+### Model Parameters
+- **Base Model:** `bert-base-uncased`
+- **Number of Labels:** 2 (binary classification)
+- **Max Sequence Length:** 128 tokens
+- **Tokenization:** WordPiece with padding and truncation
+### Training Arguments
+- **Learning Rate:** 2e-5
+- **Batch Size:** 16 (training), 32 (evaluation)
+- **Epochs:** 3
+- **Weight Decay:** 0.01
+- **Evaluation Strategy:** Every epoch
+- **Optimizer:** AdamW (default)
+### Hardware Requirements
+- **GPU:** NVIDIA Tesla T4 (or equivalent)
+- **Memory:** ~4GB GPU memory for training
+- **Training Time:** ~18 minutes for 3 epochs
+---
+## Model Architecture
+The model uses BERT's transformer architecture with:
+- **Encoder Layers:** 12
+- **Attention Heads:** 12
+- **Hidden Size:** 768
+- **Vocabulary Size:** 30,522
+- **Classification Head:** Linear layer (768 → 2)
+---
+## File Structure
+```
+sarcasm-detection/
+├── sarcasm_model/              # Main fine-tuned model
+│   ├── config.json
+│   ├── model.safetensors
+│   ├── tokenizer_config.json
+│   ├── special_tokens_map.json
+│   ├── vocab.txt
+│   └── tokenizer.json
+├── quantized-model/            # Float16 quantized version
+│   ├── config.json
+│   ├── model.safetensors
+│   └── tokenizer files...
+├── logs/                       # Training logs
+├── sarcasm-detection.ipynb     # Training notebook
+└── README.md                   # This file
+```
+---
+## Quantization
+A quantized version of the model is available for deployment optimization:
+```python
+# Load quantized model (Float16)
+quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized-model")
+quantized_model = quantized_model.to(dtype=torch.float16)
+```
+**Benefits of Quantization:**
+- **Reduced Memory Usage:** ~50% smaller model size
+- **Faster Inference:** Improved speed on compatible hardware
+- **Minimal Accuracy Loss:** Maintains classification performance
+---
+## Limitations
+- **Domain Specificity:** Trained primarily on headlines; may not generalize perfectly to other text types
+- **Context Dependency:** Sarcasm detection can be highly context-dependent and subjective
+- **Cultural Nuances:** May not capture sarcasm patterns from different cultural contexts
+- **Short Text Focus:** Optimized for headline-length text (typically under 128 tokens)
+---
+## Potential Improvements
+- **Data Augmentation:** Include more diverse sarcasm examples
+- **Ensemble Methods:** Combine multiple models for better accuracy
+- **Context Integration:** Incorporate additional context beyond the headline
+- **Multi-language Support:** Extend to other languages
+- **Real-time Processing:** Optimize for streaming applications
+---
+## Applications
+- **Social Media Monitoring:** Detect sarcastic comments and posts
+- **Content Moderation:** Identify potentially misleading sarcastic content
+- **Sentiment Analysis Enhancement:** Improve sentiment classification accuracy
+- **News Analysis:** Analyze editorial tone and bias in headlines
+- **Customer Feedback:** Better understand customer sentiment in reviews
+---
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{sarcasm_detection_bert,
+  title={BERT-based Sarcasm Detection for Headlines},
+  author={Your Name},
+  year={2025},
+  note={Fine-tuned BERT model for binary sarcasm classification}
+}
+```
+---
+## Contributing
+Contributions are welcome! Please feel free to:
+- Report bugs or issues
+- Suggest improvements
+- Add new features
+- Improve documentation
+---
+## License
+This project is licensed under the MIT License. The underlying BERT model follows Google's Apache 2.0 license.
+---
+## Acknowledgments
+- **Hugging Face** for the Transformers library
+- **Google Research** for the original BERT model
+- **Kaggle** for providing the Sarcasm Headlines Dataset
+- **PyTorch** for the deep learning framework

config (1).json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float16",
+  "transformers_version": "4.51.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

model (2).safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bbf7b49382497d92d46b78336d6237fe51bea9d02e127b473a0bb681f9568363
+size 249318428

special_tokens_map (1).json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer (1).json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config (1).json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab (1).txt ADDED Viewed

The diff for this file is too large to render. See raw diff