Start a new topic

What happens to the structure of CafeTran TXT glossaries when imported to TM-Town "Terminology files"?

Hi Igor/Kevin,


I am playing around with uploading various of my CT TXT glossaries to TM-Town, to see how (Lucene) matching works, and realised I don't understand how the structure of my CT glossary gets used to create the online TM-Town version.


The header of my (tab-delimited) TXT glossaries currently has:


#nl-NL [TAB] #en-GB [TAB] #Context [TAB] #Subject [TAB] #Client [TAB] #Note [TAB] #Sense [TAB] #Usage example [TAB] #Source [TAB] #URL


Synonyms in the 1st two columns (my src and trgt language) are separated by a semi colon. Do these synonyms get split into synonyms in the TM-Town glossary? I checked, and I don't think so.


Also, and following Kwang's recent comment elsewhere (pasted below), what happens to regular expressions, pipe characters, etc.?


Michael


Kwang's comment:


Hi Igor,


What about the semi-colon characters and pipe characters in the glossary (both at the source and target sides)?

If I upload my glossaries to TM-Town, apart from the Lucene thing, will it work the same way as CT (e.g. giving/displaying matches, auto-assembling, regex.. etc.)?


Kwang


Hi Kevin,


Regarding pipe character, it is used in front of an entry to specify that it is a regex.

Please note that there is also a regex txt glossary (i.e. all entries are regex) where such specification is not needed.

Pipe characters are used in a regex entry to represent the alternation just like normal regex too.


Another interesting point to mention is Igor's method to handle numbers in regex entry. For example:


|(\d+) mm. [TAB] 0 มิลลิเมตร

(in case of EN-TH glossary :))


That zero represents the corresponding number which will be inserted in the target pane by auto-assembling.


Kwang


2 people like this

Hi Kevin,


There are two options for TXT glossaries:


(1) The whole glossary is regex-ready, so to speak, which can be set in the glossary settings:



(2) Individual entries have a regex in them. In this case, each regex entry must be preceded by a pipe character (to tell Ct to interpret the entry as such), thus:


|\d+ autos [TAB] 0 cars


and in case of synonyms:


|\d+ autos;|\d+ auto's;|\d+ auto’s [TAB] 0 cars


CafeTran knows where the regex ends because that is where the entry ends. If there are several synonyms, it ends at the semi colon. If there are no synonyms, it ends at the tab.


In reply to your question, "Is the regex always in parenthesis after the pipe character?": no. 


I am no "regexpert", but I always write "\d+" (i.e. without the parentheses), whereas I noticed that Kwang wrote them with them. Maybe Igor could explain this...


1 person likes this

Thank you Igor for your explanation!

I wrote them with them for visual convenience at first and I saw that it worked, so I always wrote them like that. :D.

Hi Igor, 


I've already asked this once, but I will ask it again as I think it is quite important: can we please have a subforum for regular expressions? It would be great to have one clear place to put all regex-related chat!


Michael

Hi Kwang,


Thank you for the explanation!


So the pipe character represents the start of the regex. How does CafeTran know the end of the regex? Is the regex always in parenthesis after the pipe character?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression(dog) creates a single group containing the letters "d" "o" and "g".


Source: Capturing groups


There is no need to use parenthesis as Kwang did to catch digits.


Igor

Hi Michael,


Thanks for starting this thread. I still have a lot of work to do in this area and would classify it as still under development. The current implementation is very simple and lacking.


Currently TM-Town is only extracting the source and target term. Now that I have started building out the amount of information a TM-Town term holds (i.e. definitions, contexts, examples sentences) I need to circle back to this and map CafeTran files and improve the importer. So this is on the to-do list.


Synonyms in the 1st two columns (my src and trgt language) are separated by a semi colon. Do these synonyms get split into synonyms in the TM-Town glossary? I checked, and I don't think so.


Currently this is not split but kept as one entry. I could split it if you like. I think one advantage to keeping it together is that you will be able to see the synonym when you do a search. For example:

"happy;elated"


With the Lucene search if you search the term happy, you would get back this entry, and would see the synonym. If the terms were split into different entries, you wouldn't see elated when you search happy.


Definitely looking for your feedback here though. Let me know how you use the synonyms and I can try to design it to best fit your needs.


what happens to regular expressions, pipe characters, etc.?


Currently each term is just imported as plain text, so nothing. I'm a CafeTran newbie, so I am still learning. It would be helpful to hear some examples of how you use regular expressions (baked into the term file?) and pipe characters in your workflow and this will help me to understand what the best solution should be.

Best,
Kevin



Roger that, captain!

Login to post a comment