Start a new topic

Total Recall, subsegment matching, fuzziness and Kevin Flanagan’s ‘Lift’.

Hi Igor,


OK, I have a question, which came about in connection with a discussion that is currently going on over at Proz about the new Lift technology, and Total Recall.


It basically boils down to this: how does subsegment matching in CafeTran relate to, or work in 9if it does), Total Recall? 


Hans (van den Broek) and I have been trying to understand it in the forum discussion above. Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations. I am not sure, or at any rate, don't have a clue.


If you look at my post titled " here’s an example of what I mean (CafeTran LIFTing)" (which is here: http://www.proz.com/forum/sdl_trados_support/289937-lift_technology_is_it_on_its_way-page3.html), you'll see that there seems to be some subsegment matching (and hence fuzziness) in my example screenshots. However, how is this possible?


Sorry if my question is not very well posed. I'm a bit short on time, as usual.

png

MB: Hans says that since the TR database is SQLite, it cannot apply fuzziness in its operations.


That's not what I said. I said: TR uses data in a very simple SQLite database to extract segments and words from the document to be translated. Those data are non-hierarchical/unstructured, so there's no fuzziness nor subsegment matches, only "complete segments" and words.


Studio's SDLTM is also an SQLite database, but it does allow for fuzziness and subsegment matching. So does DejaVu's MDBX Access database. "We" have TMX (segments, termbases) files for those purposes, whereas DV and Studio only have those databases. The idea behind CT's external databases was (correct me if I'm wrong, Igor) to provide a means to search large resources fast, and to collaborate with others (via a server). Igor added the Recall functionality later to quickly extract segments and words from those large resources.

But I see another perfect storm coming. "We also want fuzziness in plain text glossaries and databases, and subsegment matching. We want more. We want more. More. Mooore. Bigger. BIGGER. BIGGEST."


H

Yes, you need to have at least one table in Total Recall to add segments to it.


Please see: https://cafetran.freshdesk.com/solution/categories/6000028195/folders/6000058183/articles/6000053020-creating-a-total-recall-table


Igor

Hi Igor,


I've been playing around with Total recall again, and I have a question.


Before I ask it though, let me indicate how I usually use TR:


I have a huge TMLookup .db, where I dump all my TMXs, including every project TMX right after I finish a job in CT.


Then, each time I start a job, I run Total Recall > Recall to Recall to memory… 


with these settings:




Now my question is: there are past jobs that I finished, and if I go back and open one of them, and then run the above command on them, the TMX that Total recall creates will show hits for some of the project's segment, but not all of them. I don't understand this. After all, I send every single project TMX to my Total Recall database. How can I make it so every single future segment is shown in these TMXs, if that is, they are in my big TMLoopkup db?


What settings do I need to change in order to use Total recall in this way?


Also note that I am less fussed about subsegment matches in these TR TMXs that I am about being able to locate any exact (or very high fuzzy) matches with past projects.


Michael

Things go wrong if (not sure, though):

  1. Your database is too big, and doesn't have properties set. That's your case, I suppose. To fight that, you'll have to increase the number of hits per word, probably resulting in a huge TMX file
  2. The matches you are looking for consist of very common words. Again, you won't get all matches, unless etc.
  3. Exact matches will always show up, fuzzies never. That is, only if they can be extracted from the TMX file, see 1.
H.

Hi Michael,


As I remember your segments Total Recall base is really huge (around 40 million units). Then, I would suggest increasing the Recall in context value somewhere in the 500 to 1000 range. This will increase the probability of retrieving most of, if not all, the relevant segments. I will write another article on the subject soon.


Igor

Hi Igor,


Thanks, I set it to 1000, and it seems to be finding almost all of them. I'll continue testing when I have time, but it seems to now find around 95% of them (which is good enough for my purposes: checking if I have ever translated something similar in the past). The ones not being found are mainly the shortest segments, it seems.


Michael

More details, please. I just ran a 2,300 words document against the 2.5 million segments DGT with hits/word 100. It resulted in a 33,500 segments TM (don't ask me why). I have screenshots of all relevant data.


If you run a similar document against a 40 million segments table with 1,000 hits/word, I expect the resulting TM cannot be used for automatic workflow, and manual workflow will be considerably slower than a manual search in the table. Even "pretranslate" may result in significant delays.


Where do I go wrong? Do I go wrong?


H.

My 2000-word project created a 100 MB TMX (around 150,000 TUs). This was with "Recall in context" in Total Recall set to 1000. I am currently trying to run:


Translation > Pre-translate all segments


on the tmdata_TM.tmx file

Keep getting this error:



OK, enough playing around. Got to switch all this nonsense off and make some actual money for a few hours.


Am I right when I say

  • You can't Recall a whole DB (unless it consists of one table only)
  • You can only filter (on context, for example) after extracting the TM. In other words, you can't filter the table in CT before you run Recall

H.

Hi Hans,

  • You can recall the whole base by selecting "All tables" in Total Recall options.
  • You can filter the table before you run Recall. Please see: https://cafetran.freshdesk.com/solution/categories/6000028195/folders/6000058183/articles/6000054602-recalling-segments-with-properties-filter

Igor

Terima kasih, terima kasih, terima kasih.


H.

I was wondering what Michael was doing - I always wonder what on earth he's doing - and then I realised he's using the TM_Lookup database, rather than the CT database.

A few remarks on databases may be useful.

  1. You can't create a database in CT. Igor did that for us. You can find it in the package, in /Applications/CafeTran.app/Contents/Java/resources/databases/SQLiteMemoryBase.db (Mac, something similar for other OSes)
  2. As Martin found out, if it's not there, you can't create it, so you can't use it. That makes sense. The solution I didn't think of is very simple, just download CT again (the full version,the update won't contain a SQLiteMemoryBase.db for the very simple reason that it would delete an existing DB), and copy the  db to the path mentioned above. And delete the other downloaded files.
  3. The (field) options for the CT DB are limited to the ones mentioned in Edit | Options
  4. You can't create a DB in CT, but you can create tables, including for different language pairs. See the KB.
  5. You can only use table related SQL commands and queries
  6. You can connect to other databases like TM-Lookup, or one you created yourself. Just enter jdbc:sqlite:PathToYour.DB in Edit | Options | Database | Database Connection | Connection URL

H.


7. You can recall more than 1 table...

... and assign different priorities to the resulting TMs. I don't know if you can do that with fields (rather than tables), like context General and context Automative, for example.


H.

8. To search for phrases in a table, use quotes.


H.

Login to post a comment