Stemming refers to adding a pipe character after the stem/root of a word.
house| will also recognise houses
child| will also recognise children
Prefix matching has little (nothing) to do with prefixes. It refers to the first so may letters of a word. It may have the same result as stemming, but the latter is more exact.
Yes, thank you. That is all clear. But I still don't fully know how these values work. Well, I guess I have to do some tests then, don't I. Stay tuned...
And yes, the pipe insertion will be more exact, but the Prefix is at the file level and easier to use (you don't have to bother with inserting pipes while adding new terms).
From the wiki:
Using prefix matching
When this option is selected, CafeTran will analyze the beginnings of words (here called prefixes) and discard any endings responsible for inflection of words.
It is an option which increases significantly the number of hits for highly inflected languages. The length of prefixes is set by a percentage number. The bigger the percent number the longer the prefix of words which the program will analyze.
The minimal prefix length option (menu Edit > Options > Memory > Minimal prefix length) lets you set the minimal allowed length of prefixes. The length can also be fixed, when the "fixed" option selected, instead of a set percentage length. It means that all the words will have the minimal prefix length, no matter their actual length.
Perhaps it is better to change the name Prefix matching to something new. After all:
What the CafeTran feature does (I guess) is truncating (shorten (something) by cutting off the top or the end.) the word stem, possibly up to the prefix.
So, probably 'Truncating' will be better name.
I'd also like to know exactly what the fixed setting will do. I do not understand the part:
It means that all the words will have the minimal prefix length, no matter their actual length.
BTW: If Igor agrees with renaming Memory for terms to termbases, there will likely a tab Termbases needed, that will contain settings like the Truncating variants.
Or rename the Glossary tab to a Terms tab and add the Termbase related Options there too.
And he could even consider to add the Auto-assembling options there.
Guess what I found:
The AA tab is difficult in this regard: some items are related to terms, some others aren't. Perhaps move the related ones to a new Terms tab and the other ones to Workflow?
I think this one belongs here too:
The new Terms tab could contain:
Created 10 Termbases to start testing the Truncating and discovered that the Truncation value isn't saved in the Termbase.
So it's not possible to have different Truncation percentages active at the same time. If I change the value for one Termbase, the others are affected too.
I guess I can better go for a run now ;).
I have now tested this. I think that the Truncation setting should be stored in the Termbase too. Termbases can be used in two directions and the Truncation settings can be very different for these languages. And other reasons (like having multiple Termbases loaded, which require different settings for Truncation).
Okay, here are the pictures. I've made them with Truncation (Prefix matching) set to Fixed length, which is by default 4 (characters):
All 4 terms are recognised.
With Truncation set to 10 (percent) all the terms in all segments are recognised.
With Truncation set to 80 (percent) only the terms in the first two segments are recognised.
One more reason to store the Prefix matching value in the Termbase:
**PLEASE NOTE:** You will have to experiment with the value for Prefix matching. This value will be depending on some characteristics of your source language and will likely vary per language.
Can somebody answer this?
When you choose Fixed length only the first four letters will be used for term matching (term recognition) OR: the last four letters will be truncated.