When running "Recall memory" in Totat Recall, how can I make it so CT ends up with less TU in the resulting TMX? That is, how can I lower the % of fuzzy matches that should be sent to the TMX?
When running Recall memory on relatively large TR databases (of around 1,000,000–2,000,000 segments), I end up with a TMX that is often around 250 MB. I obviously set these TMXs to Preliminary memory matching only, but the Preliminary memory matching for a TMX of this size takes forever.
By way of comparison, when running pre-translation in LogiTerm Pro on the same document, against what is basically exactly the same database, the TMX file that LogiTerm produces is much, much smaller, and thus much more manageable inside CT.
His Countness: ....how can I make it so CT ends up with less TU in the resulting TMX?
You can try - try - to lower the number of hits the TR TM will show. The default value was 100, so when I lowered it to 50, that Kmitowski bloke decided to put it up, to 1,000 if I'm not mistaken. Can't find where to do this anymore, and there is of course a risk of missing (subsegment) matches if you set the value too low.
Total Recall operates on a different principle than fuzziness. It retrieves segments based on the contextual (taking current project's context into account) search.
1. Make sure the Recall in context (hits per word) is checked.
2. Try lowering the default 1000 hits to 500 or 250. In most cases, even 100 hits produces a good working TM.
Fuzzy matching takes place only after the recall is completed.
Indeed, I sort of knew the answer already: experiment with (lower) the "Recall in context" number. I'll play around with that number, and do some comparisons (also with LogiTerm pre-translations run on the same db), etc. and report back here.
As Hans, mentioned, I am always afraid of missing something if I lower it too much. Somehow, LogiTerm seems to always produce very small TMXs, and yet manage to always catch anything relevant. I'd love to know what their underlying system is, and how it compares to how TR works. As far as I know, LogiTerm uses SQLite databases for its data. I'm currently using MySQL, btw, as I find it imports much faster than SQLite. LogiTerm also indexes TMXs very fast. Quite a bit faster than importing a TMX into a TR table in CT.
I was expecting to be able to tweak something @ Edit > Preferences > Memory, but now understand that this is not necessary/possible, because Total Recall and Fuzzy matching operate on two different principles.
>> I am always afraid of missing something if I lower it too much.
It's possible, I think, because TR works as follows (my understanding).
For instance, when the source is "This is an example" and the hits number is set at 1000, CT picks up the first 1000 TUs that contain "This" from the database, and save them in a working memory. Next, it does the same thing for "is," "an", and "example."
This means it is possible that CT stops searching the database before it reaches the end of the database (because it stops when the specified number of hits are found) and that TUs located toward the end of the database (newer ones?) may be missed even when they are good matches, if the hits number is too low.
> By way of comparison,
Try the following:
I'm trying what you suggested.
However, the problem is, and remains, that performing the Preliminary matching step takes very long. I just ran another Recall (with Recall in context set to "250" this time, to try to reduce the size of the initial extraction), but it look like Preliminary memory matching is going to take forever again.
The last one I did (with Recall in context set to "1000") took like 4 hours or something (!!!). And all the while, you of course can't shut CT or restart or whatever.
And this when LogiTerm performs its pre-translation scan in around 15 mins max. What on earth kind of system are they using, that means it can pre-translate against such large dbs so fast? There is only really one setting (%) you can change in LogiTerm's LogiTrans tool, which is called "Minimum similarity" (I always set mine to 60%). It looks like this:
Preliminary memory matching takes longer with the Fuzzy and Hits option selected. Here, apart from the usual fuzzy matching, CT also guesses the meaning of subsegments, which takes time. Select Fuzzy only option to reduce the processing significantly. No matter what option you select, you don't need to wait until Preliminary matching is completed. Just make sure it goes ahead a few segments.
> Select Fuzzy only option to reduce the processing significantly.
Do the above if you wish to speed up time to complete the matching. Otherwise, I wouldn't recommend it because:
1. You can still work with the Preliminary matching ahead in the background.
2. Hits and Fragments are quite useful extraction stuff.
Hmm, that was a very good point you made there: "No matter what option you select, you don't need to wait until Preliminary matching is completed. Just make sure it goes ahead a few segments."
For some reason, I get uncomfortable when a progress bar is .. progressing, but of course you're right: as long as it's one or two segments ahead of me (which should be easy, as I am a slow translator), who cares, right?
Also, I have 32 GB of RAM and tons of (SSD) storage space, so it also doesn't matter how large the Recalled TMs are, or how much crap is being held in my RAM (within reason, of course).