PoetschLab
/

GROVER

+---
+tags:
+- biology
+- DNA
+- genomics
+---
+This is the official pre-trained model introduced in [GROVER : A foundation DNA language with optimized vocabulary learns sequence context in the human genome](https://www.biorxiv.org/content/10.1101/2023.07.19.549677v2)
+    from transformers import AutoTokenizer, AutoModelForMaskedLM
+    import torch
+    # Import the tokenizer and the model
+    tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
+    model = AutoModelForMaskedLM.from_pretrained("PoetschLab/GROVER")
+Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
+We advice to add 100 nucleotides at the beginning and end of every sequence in order to garantee that your sequence is represented with the same tokens as the original tokenization.
+We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).