OK, I have a question, which came about in connection with a discussion that is currently going on over at Proz about the new Lift technology, and Total Recall.
It basically boils down to this: how does subsegment matching in CafeTran relate to, or work in 9if it does), Total Recall?
Hans (van den Broek) and I have been trying to understand it in the forum discussion above. Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations. I am not sure, or at any rate, don't have a clue.
If you look at my post titled " here’s an example of what I mean (CafeTran LIFTing)" (which is here: http://www.proz.com/forum/sdl_trados_support/289937-lift_technology_is_it_on_its_way-page3.html), you'll see that there seems to be some subsegment matching (and hence fuzziness) in my example screenshots. However, how is this possible?
Sorry if my question is not very well posed. I'm a bit short on time, as usual.
MB: Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations.
That's not what I said. I said: TR uses data in a very simple SQLite database to extract segments and words from the document to be translated. Those data are non-hierarchical/unstructured, so there's no fuzziness nor subsegment matches, only "complete segments" and words.
Studio's SDLTM is also an SQLite database, but it does allow for fuzziness and subsegment matching. So does DejaVu's MDBX Access database. "We" have TMX (segments, termbases) files for those purposes, whereas DV and Studio only have those databases. The idea behind CT's external databases was (correct me if I'm wrong, Igor) to provide a means to search large resources fast, and to collaborate with others (via a server). Igor added the Recall functionality later to quickly extract segments and words from those large resources.
But I see another perfect storm coming. "We also want fuzziness in plain text glossaries and databases, and subsegment matching. We want more. We want more. More. Mooore. Bigger. BIGGER. BIGGEST."
Yes, you need to have at least one table in Total Recall to add segments to it.
Please see: https://cafetran.freshdesk.com/solution/categories/6000028195/folders/6000058183/articles/6000053020-creating-a-total-recall-table
I've been playing around with Total recall again, and I have a question.
Before I ask it though, let me indicate how I usually use TR:
I have a huge TMLookup .db, where I dump all my TMXs, including every project TMX right after I finish a job in CT.
Then, each time I start a job, I run Total Recall > Recall to Recall to memory…
with these settings:
Now my question is: there are past jobs that I finished, and if I go back and open one of them, and then run the above command on them, the TMX that Total recall creates will show hits for some of the project's segment, but not all of them. I don't understand this. After all, I send every single project TMX to my Total Recall database. How can I make it so every single future segment is shown in these TMXs, if that is, they are in my big TMLoopkup db?
What settings do I need to change in order to use Total recall in this way?
Also note that I am less fussed about subsegment matches in these TR TMXs that I am about being able to locate any exact (or very high fuzzy) matches with past projects.
Things go wrong if (not sure, though):
As I remember your segments Total Recall base is really huge (around 40 million units). Then, I would suggest increasing the Recall in context value somewhere in the 500 to 1000 range. This will increase the probability of retrieving most of, if not all, the relevant segments. I will write another article on the subject soon.
Thanks, I set it to 1000, and it seems to be finding almost all of them. I'll continue testing when I have time, but it seems to now find around 95% of them (which is good enough for my purposes: checking if I have ever translated something similar in the past). The ones not being found are mainly the shortest segments, it seems.
More details, please. I just ran a 2,300 words document against the 2.5 million segments DGT with hits/word 100. It resulted in a 33,500 segments TM (don't ask me why). I have screenshots of all relevant data.
If you run a similar document against a 40 million segments table with 1,000 hits/word, I expect the resulting TM cannot be used for automatic workflow, and manual workflow will be considerably slower than a manual search in the table. Even "pretranslate" may result in significant delays.
Where do I go wrong? Do I go wrong?
My 2000-word project created a 100 MB TMX (around 150,000 TUs). This was with "Recall in context" in Total Recall set to 1000. I am currently trying to run:
Translation > Pre-translate all segments
on the tmdata_TM.tmx file
Keep getting this error:
OK, enough playing around. Got to switch all this nonsense off and make some actual money for a few hours.
Am I right when I say
Terima kasih, terima kasih, terima kasih.
I was wondering what Michael was doing - I always wonder what on earth he's doing - and then I realised he's using the TM_Lookup database, rather than the CT database.
A few remarks on databases may be useful.
7. You can recall more than 1 table...
... and assign different priorities to the resulting TMs. I don't know if you can do that with fields (rather than tables), like context General and context Automative, for example.
8. To search for phrases in a table, use quotes.