Update README.md
Browse files
README.md
CHANGED
|
@@ -14,15 +14,15 @@ For the bert-base models for other tasks, see [here](https://huggingface.co/coll
|
|
| 14 |
|
| 15 |
## General guidelines for how the lemmatizer works:
|
| 16 |
|
| 17 |
-
Given an input text in Hebrew, it attempts to match up each word with the correct lexeme
|
| 18 |
|
| 19 |
-
- If the
|
| 20 |
|
| 21 |
- If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
|
| 22 |
|
| 23 |
- For verbs the lexeme is the 3rd person past singular form.
|
| 24 |
|
| 25 |
-
This method is purely neural-based, so
|
| 26 |
|
| 27 |
Sample usage:
|
| 28 |
|
|
|
|
| 14 |
|
| 15 |
## General guidelines for how the lemmatizer works:
|
| 16 |
|
| 17 |
+
Given an input text in Hebrew, it attempts to match up each word with the correct lexeme from within the BERT vocabulary.
|
| 18 |
|
| 19 |
+
- If the word is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
|
| 20 |
|
| 21 |
- If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
|
| 22 |
|
| 23 |
- For verbs the lexeme is the 3rd person past singular form.
|
| 24 |
|
| 25 |
+
This method is purely neural-based, so in rare instances the predicted lexeme may not be lexically related to the input, but rather a synonym selected from the same semantic space. To handle those edge cases one can implement a filter on top of the prediction to look at the top K matches and choose using a specific set of measures, such as edit distance, to choose the prediction that can more reasonably form a lexeme for the input word.
|
| 26 |
|
| 27 |
Sample usage:
|
| 28 |
|