Update README.md
Browse files
README.md
CHANGED
|
@@ -7,7 +7,22 @@ license: apache-2.0
|
|
| 7 |
This is ["naver-clova-ix/donut-base"](https://huggingface.co/naver-clova-ix/donut-base) but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.
|
| 8 |
|
| 9 |
|
| 10 |
-
The original model, `"naver-clova-ix/donut-base"`, did not have a token for `"1"`, so that has also been added. The notebook remove-donut-tokens.ipynb details the whole process.
|
| 11 |
|
| 12 |
|
| 13 |
-
This has not been trained any more than the original model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
This is ["naver-clova-ix/donut-base"](https://huggingface.co/naver-clova-ix/donut-base) but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.
|
| 8 |
|
| 9 |
|
| 10 |
+
The original model, `"naver-clova-ix/donut-base"`, did not have a token for `"1"`, so that has also been added. The notebook [remove-donut-tokens.ipynb](remove-donut-tokens.ipynb) details the whole process.
|
| 11 |
|
| 12 |
|
| 13 |
+
This has not been trained any more than the original model.
|
| 14 |
+
|
| 15 |
+
I made a whole video about it: https://youtu.be/Uzr553x1gdM
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
I did a quick speed test for generation against the default model and using `bad_words_ids`. The `bad_words_ids` was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.
|
| 19 |
+
|
| 20 |
+
Speed script [here](speed_test.py)
|
| 21 |
+
Launched with [this](run_speed_tests.sh)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
approach | time to generate 10 tokens
|
| 25 |
+
- | -
|
| 26 |
+
"naver-clova-ix/donut-base" | 205ms
|
| 27 |
+
"naver-clova-ix/donut-base" + 12k `bad_words_ids` | 280ms
|
| 28 |
+
"donut-base-ascii" | 195ms
|