Start a new topic

memories in TMX or tab delimited?

Hello all, 


I would like to know what would you recommend from your experience for creating and using glossaries ( or memories for terms). 


What would  you recommend: creating TMs for segments and then adding terms to the glossary or adding a new memory for terms in tmx.? 


I am confused: 


On one hand I read in Cafetran Help that "We strongly encourage you to use tab-delimited text glossaries for storing your terminology, since they have many advantages compared with TMX files. They can be sorted and edited very easily, using a text editor, a spreadsheet program or an advanced editor for tab-delimited files" 


but on another page I find a section about Reasons for using TMX files for terms which reads: 


"Since you can open TMX files as normal projects, you can perform some actions on your terminology that are not possible with the tab-delimited format"


Which option is better for making the best use of this tools when translating?

Thank you!


1 person has this question

That's OK, there are so many messages to answer in one day, you sometimes get a bit lost in them all!

Thanks so much, Michael> I am sorry I did not set a reply to your message. 

https://cafetran.freshdesk.com/support/discussions/topics/6000007800?page=1 (sorry, forgot to make that link clickable in the post editor)

Hi Elena (sorry, I called you Alejandro before, but that's what your forum name says),


There is also a bit more info on the structure of using TMXs to store terms here:


https://cafetran.freshdesk.com/support/discussions/topics/6000007800?page=1 (Hans and I are currently discussing it)


Michael

Thank you all for your detailed and helpful answers. I really appreciate them. 

It is clear that I have to figure out which method will work best for me. I am working now in building up an expense glossary of technical terms.


Thanks all and have a great Friday!

Elena


Hi Alejandro, 


I just found something I wrote recently in the [help_] Yahoo list (which I thought might be interesting or useful to you):


Hi Max,

It is indeed extremely exciting, and lots of fun. Working with my big CafeTran glossary is a joy. Similar to the regex Hans mentioned, I added around 25 different ones in my job last night, pieces of a solar power system manual. 

My main glossary is a tab-delimited UTF-8 text file, which I choose to save as ".csv" file instead of the default ".txt", so it opens in Ron's CSV Editor (HIGHLY recommended to anyone on Windows who works with tab-delimited terminology files!) when double-clicked, or when I right-click and select "Edit glossary" from inside CafeTran when translating. The reason Ron's editor (or a decent UTF-8 aware CSV editor) is so great is that it allows you to see the contents of your tab-del file as if it were in Excel, that is, in nice visually clear rows & columns (this is invaluable, imo) and perform multi-level filtering: i.e., first filter on all entries with a certain Client field, and then narrow this down to all entries containing a regular expression. VERY handy for TB maintenance tasks!

Here are a few regexes I added yesterday:

.+? = any string
\d+ = a number
.? = any character, once or not at all
----------------------------------
#nl-NL [TAB] #en-GB
|ongeveer €\d+ [TAB] approximately €000
|conform artikel /d+ [TAB] in accordance with Article 00
|max. vermogen gedurende \d+ minuten [TAB] max. power for 00 minutes
|het uitgangsvermogen van de .+? aanpassen [TAB] adjust the output power of the XXX
|Kan de .+? worden gebruikt in België? [TAB] Can the XXX be used in Belgium?
|Is het verplicht een .+? te plaatsen [TAB] Is it mandatory to install a XXX?
|Is het verplicht om een .+? te plaatsen [TAB] Is it mandatory to install a XXX?
|Werkt het .+? als .+? [TAB] Can the XXX be used as a XXX?
|\d+,\d+ kWh [TAB] 00,000 kWh
|batterijspanning van \d+ V [TAB] battery voltage of 00 V
|\d+.?\d+ kWh [TAB] 0.0 kWh
|\d+ \W [TAB] 000 W
|controleren of .+? is inbegrepen [TAB] check whether XXX is included
|Kan ik de .+? rechtstreeks aankopen bij .+? [TAB] Can I purchase the XXX directly from XXX?
|Werkt mijn .+? in back-up modus? [TAB] will my XXX work in backup mode?
|ongeveer €\d+ [TAB] approximately €600
|wanneer ik geen .+? heb [TAB] if I don’t have a XXX
|\d+u [TAB] 0 hours
|\d+ kW [TAB] 00 kW
|Kan ik een .+? gebruiken [TAB] Can I use an XXX?
|indien u de .+? in back-up modus wilt gebruiken [TAB] if you wish to use the XXX in backup mode
|\d+kVA [TAB] 5kVA

Not entirely sure about spaces between stuff like kVA, kW and numbers, but you get the picture! These little gems can save you a lot of work, not to mention it's just fun when you create one and you get to see it work in your next job ;)

My glossary file consists of the following 10 columns:

#nl-NL [TAB] #en-GB [TAB] #Context [TAB] #Subject [TAB] #Client [TAB] #Note [TAB] #Sense [TAB] #Usage example [TAB] #Source [TAB] #URL

Note that this is just my own personal preference. CafeTran lets you create as many (or as few) as you want, so you can create a full-blown termbase system, or you can just use a basic src/trgt, or src/trft/subject/client structure. Whatever floats your boat: Igor hasn't made anything mandatory. There are a few defaults that you need to stick to (I think Context needs to come third), to but basically the rest is up to you. 

Hans van den Broek hates these files, but there is no need to use them if you don't want to and the system they are built on has absolutely no effect on the rest of CT. Let me stress this: all the new features I requested re: CT's tab-delimited txt glossaries, and which Igor implemented, none of these impact the system as a whole in any way. All the pipe characters and regexes, and synonyms, etc: if you don't like them, just don't use them. Hell, if you don't go looking for them, you wouldn't even know they were there. Hans just doesn't like me and likes to complain, so he tries to make it look like I ruined something, whereas I have actually contributed a large portion of the new ideas over the last year or two. The fact of the matter is CT is just getting better and better, period. 

Tab-dels (as they are sometimes freferred to by CT users) are just another format to store terminology in, in CT. You can also store your terms in TMXs, which CT calls "TMX termbases", or "Memories for Terms" in the older lingo. Obviously, storing terms in TMX files has all manner of limitations (the main one being it can't properly handle synonyms, or bundled entries covering a single concept), but they do have the benefit of allowing you to access the fuzzy matching functionality of the translation memories (which has limited value, of you ask me, but then I'm no expert in this area as I generally don't store any terms in TMXs, only project segments).

In my system:

#nl-NL (source term; can contain synonyms; just separate with a ";"))
#en-GB (target term; can contain synonyms; just separate with a ";")
#Context ("Contextual Priority" aka "Context-aware Auto-assembling" (C-3A); see: http://cafetran.wikidot.com/using-context-aware-auto-assembling)
#Subject (can be used to give the entry priority in the auto-assembling system) 
#Client (can be used to give the entry priority in the auto-assembling system)
#Note (I use this for various purposes) 
#Sense (I use this to distinguish terms from one another)
#Usage example (self-explanatory)
#Source (where you found the term; I use this comumns to manage the glossary as a whole later in Ron's CSV Editor) 
#URL (clickable hyperlink. e.g., to quickly take you to a relevant Wikipedia article, or even file on your computer, when translating)

These are just mine. You can of course come up with your own, better system!

Also note that it is not necessary to have all these fields shown in the glossary pane, you can hide any of them you want. See e.g.:

"It is now possible to hide specific metadata fields in the Glossary pane. This can be useful to keep the Glossary pane from getting too messy if you tend to enter a lot of metadata. The fields you wish to hide can now be defined by comma separated numbers. Go to: Edit > Options > Glossary > Fields to hide, and enter a number for each field you wish to hide. Imagine your Glossary has 10 fields (#nl-NL –– #en-GB –– #Context –– #Subject –– #Client –– #Note –– #Sense –– #Usage example –– #Source –– #URL), and you want to display only the ‘sense’ field (and of course the source and target term). You would therefore enter: ‘1,2,3,4,6,7,8’. Only the source, target, and the sense field will now be displayed in your Glossary pane." (http://cafetranhelp.com/changelog)

E.g., you might enter lengthy definitions, but not want them to clog up your glossary pane: just hide them. You will still be able to see them by merely hovering over the relevant term in the glossary pane.

CafeTran's terminology system really is the most powerful and flexible of any CAT tool, imo. And yes, in terms of features that actually matter to us translators, I would say it even beats MultiTerm and Transit, often lauded as having very powerful terminology systems.

#########################


And all this is only the TERMINOLOGY FUNCTIONALITY of CafeTran. I haven't even mentioned the "Total Recall" system, which recently got a major upgrade allowing us to use the same SQLite db format that Farkas András uses in his amazing tool TMLookup (http://www.farkastranslations.com/tmlookup.php). The most important effects of this are:

(1) the Total Recall system is now much, much faster. So fast, in fact, that I can now pre-translate my entire project against my massive collection of TMXs (including all the DGT-Tms ever released for NL/EN, all the CELEX stuff from http://www.farkastranslations.com/eu_translation_memories.php, all my own TMs from over the years, all kinds of stuff from the old Opus site (now dead?) opus.lingfil.uu.se/, + the new site @ http://datahub.io/dataset/opus, and god knows how many TMs downloaded from the TAUs Data site) ... in under a minute. I actually still have my entire collection of TMXs inside memoQ, and trying something similar would take hours and hours in memoQ (I have no idea how long exactly, because I have never actually let it finish). My massive collection of TMXs actually contains around 45 million TUs and the TMLookup SQLite db file is around 25Gb on disk! CT can search through this WHOLE thing, looking for possibly useful matches in your current project in under a minute! 

(2) You can now use your default TMlookup DB file (default.db) inside CafeTran's Total Recall system without having to change or edit it at all! What's more, SQLite DBs allow concurrent lookups, so you don't even have to close TMLookup!

#########################

Hope all this made some sense! I really hate writing long posts in this silly Yahoo Groups interface. It really is a piece of &^%$. I actually wrote the whole thing in a text editor and switched the Yahoo interface to plain text, but even then the stupid system will probably double all my line endings and otherwise garble my message. Can't wait for the new CafeTran forum there has been talk about recently: an actual forum where you can write in peace and present people with a clear message.

Michael


@ https://groups.yahoo.com/neo/groups/help_/conversations/topics/46829;_ylc=X3oDMTJyMW9mdDJkBF9TAzk3MzU5NzE1BGdycElkAzI4MzEwNzcEZ3Jwc3BJZAMxNzA1MTE1Mzg2BG1zZ0lkAzQ2ODI5BHNlYwNkbXNnBHNsawN2bXNnBHN0aW1lAzE0MzUxOTc0NzA-


1 person likes this

Hmm, well, I usually move individual alternative terms around a lot, manually (using copy/paste) in the Quick term editor


I group sets of alternative terms (which, together, form one concept) (I’m not going to use the word "synonyms", because they aren’t always strictly speaking synonyms) in one entry (i.e., on one line of the tab-delimited text file), separated by semi colons, thus:


cat;alternative cat 1;alternative cat 2 [TAB] kat;alternative kat 1;alternative kat 2 


If I find that I want "alternative kat 2" to be used in AA, rather than "kat", I simply click on the entry and in the Quick term editor copy/paste and move it to the front, so next time the alternative term I stuck in front will be used.


However, there is also another way: you create two separate entries (separate lines, in the tab-delimited text file), thus:


cat [TAB] kat 


and


cat [TAB] poes


and then differentiate between the two using either the Client or Subject field (or both), thus:


-----------------------------

#nl-NL [TAB] #en-GB [TAB] #Subject 

-----------------------------


cat [TAB] kat [TAB] FINANCIAL 


and


cat [TAB] poes [TAB] LOGISTICS

---------------------------------------


Then, in logistics texts, "cat" will be translated as "kat", and in financial texts, it will be translated as "poes". Thus achieving the same effect as using multiple TMXs for terms, for different clients/subjects, etc., as Hans van den Broek does.


~~~~~~~~~~~~~~

You can of course also use a combination of the two methods above.


Oh, I see, Michael. 

Thank you for the tip. 

So, which is your technique for optimizing your glossary then? Different entries with the alternative translations and giving context? 


Gracias!

Agree with Hans, but would just like to mention that the function Library > Glossary > Merge alternative translations should only be used with very basic Glossaries. The command would completely destroy mine, for example ;)


See e.g.:


✪ **WARNING: the command Library > Glossary > Merge alternative translations mentioned below only applies to one type of tab-delimited Glossary and can destroy your Glossary if you make use of fields for storing extra info about your terms like: Subject, Client, Note, Sense, Usage example, Source and URL. Optimising your Glossary with this command can then inadvertently delete a lot if this data. Please make sure to backup your Glossary before running this command, and check the content of the file carefully after running it to see if anything strange happened.** 


@ http://cafetran.wikidot.com/optimising-your-glossaries

Thank you, Hans, I really appreciate the info!



I'd say, for reasons of simplicity and management: go for plain-text glossaries.


What is your language combination? Do you need stemming?


Let's go through the items that are listed as advantages of TMX for terms:


In the Open Memory dialogue you can set the matching type to Fuzzy. For glossaries you would have to use regular expressions to achieve this.


If you don't need fuzziness, then this is not an advantage of T4T.


In the Memory Filter dialogue you can filter on properties or regular expressions, in order to only load term pairs that match with certain properties or regular expressions.


If you don't plan to use these proporties, than this isn't an advantage of T4T.


In the QA menu you can perform some useful checks on your terms (Check spelling in target segments, Initial capitalization check, Double words check).


This one stands.


In the Task menu you can use Delete filtered TMX units to remove all term pairs where source or target term match with a filter condition (via the Find and Replace dialogue).


This one too.


You can load the TMX file for terms to a database, which theoretically means that you can store more term pairs when you have a limited amount of RAM available.


I think this is valid for glossaries too.


Note that you can easily convert a glossary to a T4T:



When you go for glossaries, you can easily add alternative source and target terms. The file format is robust and simple, you can easily edit it with a text editor and don't need to bother for any XML tags.


You also get a set of very useful optimisation tasks:



Especially the Merge task is very useful: you can constantly optimise the order of your target terms. And you can overrule it per session too, without changing the glossary (not sure whether the latter is also valid for T4T).


I's say: when no special reason to go for T4T, go for glossaries.


2 people like this
Login to post a comment