Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- biology
|
| 4 |
+
- DNA
|
| 5 |
+
- genomics
|
| 6 |
+
---
|
| 7 |
+
This is the official pre-trained model introduced in [GROVER : A foundation DNA language with optimized vocabulary learns sequence context in the human genome](https://www.biorxiv.org/content/10.1101/2023.07.19.549677v2)
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 12 |
+
import torch
|
| 13 |
+
|
| 14 |
+
# Import the tokenizer and the model
|
| 15 |
+
tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
|
| 16 |
+
model = AutoModelForMaskedLM.from_pretrained("PoetschLab/GROVER")
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
|
| 20 |
+
We advice to add 100 nucleotides at the beginning and end of every sequence in order to garantee that your sequence is represented with the same tokens as the original tokenization.
|
| 21 |
+
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).
|