OK, I have a question, which came about in connection with a discussion that is currently going on over at Proz about the new Lift technology, and Total Recall.
It basically boils down to this: how does subsegment matching in CafeTran relate to, or work in 9if it does), Total Recall?
Hans (van den Broek) and I have been trying to understand it in the forum discussion above. Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations. I am not sure, or at any rate, don't have a clue.
If you look at my post titled " here’s an example of what I mean (CafeTran LIFTing)" (which is here: http://www.proz.com/forum/sdl_trados_support/289937-lift_technology_is_it_on_its_way-page3.html), you'll see that there seems to be some subsegment matching (and hence fuzziness) in my example screenshots. However, how is this possible?
Sorry if my question is not very well posed. I'm a bit short on time, as usual.
MB: Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations.
That's not what I said. I said: TR uses data in a very simple SQLite database to extract segments and words from the document to be translated. Those data are non-hierarchical/unstructured, so there's no fuzziness nor subsegment matches, only "complete segments" and words.
Studio's SDLTM is also an SQLite database, but it does allow for fuzziness and subsegment matching. So does DejaVu's MDBX Access database. "We" have TMX (segments, termbases) files for those purposes, whereas DV and Studio only have those databases. The idea behind CT's external databases was (correct me if I'm wrong, Igor) to provide a means to search large resources fast, and to collaborate with others (via a server). Igor added the Recall functionality later to quickly extract segments and words from those large resources.
But I see another perfect storm coming. "We also want fuzziness in plain text glossaries and databases, and subsegment matching. We want more. We want more. More. Mooore. Bigger. BIGGER. BIGGEST."
A couple of weeks ago, I tested the lot. I still have the test, but I don't know exactly what I did anymore (it wasn’t for publication…), so I repeated it. That's "test2" in the ZIP. The earlier files are also in the ZIP.
I wrote a document with only four words in it - raises animals and processes - copied from a real-life EU document, opened it in CT, opened the EN-NL DGT as a table, and ran Total Recall. There's obviously no segment match, so CT starts mining the three words (I take it CT treats "and" as a stopword, Igor can explain), and this results in the expected 300 segments (minus a few, probably because double, Igor can explain).
AA enabled (as always):
I don't see any "virtual matches," but it may have something to do with it. Igor can explain.
By the way, I also tested subsegment matching using TR a couple of weeks ago. I never saved that test, because it showed the expected result: It doesn't work.
I opened the DGT, and selected a phrase near the end of it, consisting of words that are common enough to yield the required number of hits (50 per my settings then). I created a document with that phrase, consisting of three entries, starting and ending with the phrase, and one with it in the middle. No subsegment results, as expected. However, I was a little confused to see a (sub)segment result of part of the phrase, that later turned out to be a segment match. No contradiction there. I'm not going to repeat that test, but feel free to try it yourself.
The recalled segments take part in the subsegment matching that gives hits producing virtual matches. Based on the hits frequency, CafeTran creates virtual matches. See two important options which determine the accuracy of the hits:
1. Edit > Options > Subsegment to auto threshold (when the subsegment is used for auto-assembling).
2. Edit > Options > Subsegment to virtual threshold (when the subsegment is used for auto-assembling and placed in a separate virtual map which holds the virtual subsegment matches).
I started a series of articles on Total Recall. Please see the first article here: https://cafetran.freshdesk.com/solution/articles/6000052913-total-recall-as-a-storage-and-retrieval-memory-system.
Thanks, Igor. The first KB article!
And I'm beginning to see the light. And why I shouldn't have lowered the number of matches for Recall.
Everything is still as I claimed it was: Total Recall doesn't recall fuzzy matches and subsegments, only complete segments and words. However, if you have enough "word hits" (segments with the word), you will most likely get a segment with the word in the subsegment in the resulting TMX file where subsegment matching does work. Bad luck if you lowered the number of matches (like I did, from 100 to 50), and if the subsegment you're looking for is at the end of the table (as I selected in my deleted test). Correct?
So I tried to add a TMX memory to the database, thinking it would add a table to it, but it doesn't. You can add the TM to an existing table, or you can add a table, of course. Great! Excellent!
Yes, you need to have at least one table in Total Recall to add segments to it.
Please see: https://cafetran.freshdesk.com/solution/categories/6000028195/folders/6000058183/articles/6000053020-creating-a-total-recall-table
I've been playing around with Total recall again, and I have a question.
Before I ask it though, let me indicate how I usually use TR:
I have a huge TMLookup .db, where I dump all my TMXs, including every project TMX right after I finish a job in CT.
Then, each time I start a job, I run Total Recall > Recall to Recall to memory…
with these settings:
Now my question is: there are past jobs that I finished, and if I go back and open one of them, and then run the above command on them, the TMX that Total recall creates will show hits for some of the project's segment, but not all of them. I don't understand this. After all, I send every single project TMX to my Total Recall database. How can I make it so every single future segment is shown in these TMXs, if that is, they are in my big TMLoopkup db?
What settings do I need to change in order to use Total recall in this way?
Also note that I am less fussed about subsegment matches in these TR TMXs that I am about being able to locate any exact (or very high fuzzy) matches with past projects.
Things go wrong if (not sure, though):
As I remember your segments Total Recall base is really huge (around 40 million units). Then, I would suggest increasing the Recall in context value somewhere in the 500 to 1000 range. This will increase the probability of retrieving most of, if not all, the relevant segments. I will write another article on the subject soon.
Thanks, I set it to 1000, and it seems to be finding almost all of them. I'll continue testing when I have time, but it seems to now find around 95% of them (which is good enough for my purposes: checking if I have ever translated something similar in the past). The ones not being found are mainly the shortest segments, it seems.
More details, please. I just ran a 2,300 words document against the 2.5 million segments DGT with hits/word 100. It resulted in a 33,500 segments TM (don't ask me why). I have screenshots of all relevant data.
If you run a similar document against a 40 million segments table with 1,000 hits/word, I expect the resulting TM cannot be used for automatic workflow, and manual workflow will be considerably slower than a manual search in the table. Even "pretranslate" may result in significant delays.
Where do I go wrong? Do I go wrong?
My 2000-word project created a 100 MB TMX (around 150,000 TUs). This was with "Recall in context" in Total Recall set to 1000. I am currently trying to run:
Translation > Pre-translate all segments
on the tmdata_TM.tmx file