OK, I have a question, which came about in connection with a discussion that is currently going on over at Proz about the new Lift technology, and Total Recall.
It basically boils down to this: how does subsegment matching in CafeTran relate to, or work in 9if it does), Total Recall?
Hans (van den Broek) and I have been trying to understand it in the forum discussion above. Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations. I am not sure, or at any rate, don't have a clue.
If you look at my post titled " here’s an example of what I mean (CafeTran LIFTing)" (which is here: http://www.proz.com/forum/sdl_trados_support/289937-lift_technology_is_it_on_its_way-page3.html), you'll see that there seems to be some subsegment matching (and hence fuzziness) in my example screenshots. However, how is this possible?
Sorry if my question is not very well posed. I'm a bit short on time, as usual.
So I tried to add a TMX memory to the database, thinking it would add a table to it, but it doesn't. You can add the TM to an existing table, or you can add a table, of course. Great! Excellent!
Thanks, Igor. The first KB article!
And I'm beginning to see the light. And why I shouldn't have lowered the number of matches for Recall.
Everything is still as I claimed it was: Total Recall doesn't recall fuzzy matches and subsegments, only complete segments and words. However, if you have enough "word hits" (segments with the word), you will most likely get a segment with the word in the subsegment in the resulting TMX file where subsegment matching does work. Bad luck if you lowered the number of matches (like I did, from 100 to 50), and if the subsegment you're looking for is at the end of the table (as I selected in my deleted test). Correct?
The recalled segments take part in the subsegment matching that gives hits producing virtual matches. Based on the hits frequency, CafeTran creates virtual matches. See two important options which determine the accuracy of the hits:
1. Edit > Options > Subsegment to auto threshold (when the subsegment is used for auto-assembling).
2. Edit > Options > Subsegment to virtual threshold (when the subsegment is used for auto-assembling and placed in a separate virtual map which holds the virtual subsegment matches).
I started a series of articles on Total Recall. Please see the first article here: https://cafetran.freshdesk.com/solution/articles/6000052913-total-recall-as-a-storage-and-retrieval-memory-system.
By the way, I also tested subsegment matching using TR a couple of weeks ago. I never saved that test, because it showed the expected result: It doesn't work.
I opened the DGT, and selected a phrase near the end of it, consisting of words that are common enough to yield the required number of hits (50 per my settings then). I created a document with that phrase, consisting of three entries, starting and ending with the phrase, and one with it in the middle. No subsegment results, as expected. However, I was a little confused to see a (sub)segment result of part of the phrase, that later turned out to be a segment match. No contradiction there. I'm not going to repeat that test, but feel free to try it yourself.
A couple of weeks ago, I tested the lot. I still have the test, but I don't know exactly what I did anymore (it wasn’t for publication…), so I repeated it. That's "test2" in the ZIP. The earlier files are also in the ZIP.
I wrote a document with only four words in it - raises animals and processes - copied from a real-life EU document, opened it in CT, opened the EN-NL DGT as a table, and ran Total Recall. There's obviously no segment match, so CT starts mining the three words (I take it CT treats "and" as a stopword, Igor can explain), and this results in the expected 300 segments (minus a few, probably because double, Igor can explain).
AA enabled (as always):
I don't see any "virtual matches," but it may have something to do with it. Igor can explain.
MB: Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations.
That's not what I said. I said: TR uses data in a very simple SQLite database to extract segments and words from the document to be translated. Those data are non-hierarchical/unstructured, so there's no fuzziness nor subsegment matches, only "complete segments" and words.
Studio's SDLTM is also an SQLite database, but it does allow for fuzziness and subsegment matching. So does DejaVu's MDBX Access database. "We" have TMX (segments, termbases) files for those purposes, whereas DV and Studio only have those databases. The idea behind CT's external databases was (correct me if I'm wrong, Igor) to provide a means to search large resources fast, and to collaborate with others (via a server). Igor added the Recall functionality later to quickly extract segments and words from those large resources.
But I see another perfect storm coming. "We also want fuzziness in plain text glossaries and databases, and subsegment matching. We want more. We want more. More. Mooore. Bigger. BIGGER. BIGGEST."