Start a new topic

How does stemming work exactly?

Stemming can be triggered manually by inserting a | in a word in a glossary (in a Termbase too?). // Or it can be triggered automatically on the level of a whole Termbase, via the setting Prefix ?matching, which has several values. // Since I get nice results with fuzzy matching in Termbases, when I check Fixed matching and select Fixed length from the drop down list, I'd like to know what the other items in this list refer to. Percentages of the length of the source term? // If so, is Fixed length a percentage too? // On a side note: Can I interpret Stemming as a special form of fuzzy matching? Where the matching takes place at the end of words only? // Are both names needed?

Stemming refers to adding a pipe character after the stem/root of a word.

house| will also recognise houses

child| will also recognise children

Prefix matching has little (nothing) to do with prefixes. It refers to the first so may letters of a word. It may have the same result as stemming, but the latter is more exact.


Yes, thank you. That is all clear. But I still don't fully know how these values work. Well, I guess I have to do some tests then, don't I. Stay tuned...

And yes, the pipe insertion will be more exact, but the Prefix is at the file level and easier to use (you don't have to bother with inserting pipes while adding new terms).

From the wiki:

Using prefix matching

When this option is selected, CafeTran will analyze the beginnings of words (here called prefixes) and discard any endings responsible for inflection of words.

It is an option which increases significantly the number of hits for highly inflected languages. The length of prefixes is set by a percentage number. The bigger the percent number the longer the prefix of words which the program will analyze.

The minimal prefix length option (menu Edit > Options > Memory > Minimal prefix length) lets you set the minimal allowed length of prefixes. The length can also be fixed, when the "fixed" option selected, instead of a set percentage length. It means that all the words will have the minimal prefix length, no matter their actual length.

Perhaps it is better to change the name Prefix matching to something new. After all:

A prefix is an affix which is placed before the stem of a word.

What the CafeTran feature does (I guess) is truncating (shorten (something) by cutting off the top or the end.) the word stem, possibly up to the prefix.

So, probably 'Truncating' will be better name.

I'd also like to know exactly what the fixed setting will do. I do not understand the part:

It means that all the words will have the minimal prefix length, no matter their actual length.


BTW: If Igor agrees with renaming Memory for terms to termbases, there will likely a tab Termbases needed, that will contain settings like the Truncating variants.

Or rename the Glossary tab to a Terms tab and add the Termbase related Options there too.

And he could even consider to add the Auto-assembling options there.

Guess what I found:

The AA tab is difficult in this regard: some items are related to terms, some others aren't. Perhaps move the related ones to a new Terms tab and the other ones to Workflow?

I think this one belongs here too:

The new Terms tab could contain:

1 person likes this

Created 10 Termbases to start testing the Truncating and discovered that the Truncation value isn't saved in the Termbase.

(5.71 KB)

So it's not possible to have different Truncation percentages active at the same time. If I change the value for one Termbase, the others are affected too.

I guess I can better go for a run now ;).

I have now tested this. I think that the Truncation setting should be stored in the Termbase too. Termbases can be used in two directions and the Truncation settings can be very different for these languages. And other reasons (like having multiple Termbases loaded, which require different settings for Truncation).

Okay, here are the pictures. I've made them with Truncation (Prefix matching) set to Fixed length, which is by default 4 (characters):

All 4 terms are recognised.

With Truncation set to 10 (percent) all the terms in all segments are recognised.

With Truncation set to 80 (percent) only the terms in the first two segments are recognised.

I now see a problem rising: when I select the percentage 10 this means that only on the first 10% CafeTran will match thus taking away (truncating) 90%
So, if you would read in the name of this prefix matching to truncation, you would also have to swap the percentages, 10% would become 90%, etc.
Or: Truncation up to:

One more reason to store the Prefix matching value in the Termbase:

**PLEASE NOTE:** You will have to experiment with the value for Prefix matching. This value will be depending on some characteristics of your source language and will likely vary per language.

Can somebody answer this?

When you choose Fixed length only the first four letters will be used for term matching (term recognition) OR: the last four letters will be truncated. 

Login to post a comment