Start a new topic

Use regex to harvest multi-word terms

Hello Kevin, Can you define or is your platform already offering prebuilt regex to harvest multi-word glossary candidates like: The old man I for me a brand-new day See the link in my posting about regex in the Tips&Tricks subforum. Cheers H

Can one define...
The old man/I for one/a brand-new day

Hi Hans,

You mean like the "new" feature in DVX3 (which CafeTran has already had for quite some time)?


(1) ("Déjà Vu X3 now supports using Regular Expressions in the Search and Replace dialogs.") +

(2) ("Extract terminology with Déjà Vu X3 using Regular Expressions")

Hi Hans,

TM-Town currently harvests multi-word terms by searching for n-grams (bi-grams, tri-grams, 4-grams), extracting those that occur at a high frequency, and filtering out uniques (i.e. removing a bi-gram that is part of a larger tri-gram).

I'd be interested to learn more about how you would use a regex. One concern I have with a regex solution is that it is difficult to make a language independent solution. In other words, it might work fine for European languages, but probably will fail hard for asian languages.

TM-Town can still get much, much better at extracting multi-word terms, so open to any and all ideas.


Hello Kevin, The solution you describe has already been offered in many other tools. None of them succeeds in anything else but producing a lot of noise, with occasionally a gem. Language-independent approaches won't be possible. But consider it from the user's perspective instead from the developer's perspective: why should a French translator be interested in a solution for Russian? I think that with the community it should be possible to define some very productive regexes for the most frequently used languages in the cat world. No need to cover all n languages of the world. Why, for instance not use the fact that German nouns start with a capital letter? Why throw away this valuable linguistic marker just because other languages don't use it? Why not use the sheer mathematical linearity of adjective conjugations to find that multi-word compounds actually are belonging to the same nest. Why not use MT, perhaps even simultaneously to several languages to isolate the noun compounds, verb compunds etc. Cheers, Hans

1 person likes this

Hello Hans,

Thanks, all great points! I definitely plan to spend more time soon on improving the term extraction in TM-Town. I've already made some adjustments over the past few days based on your email feedback.

If you are willing, it would be great if you could send me a small sample doc in your language of choice (maybe German) and a second document with the terms that you would like/expect to see extracted. This way I have a base to test against.

I know you are busy, so no worries if you can't, but much appreciated if possible.



I am in the iPhone so please don't mind the sloppy formatting. You could also integrate the numerous linguistic treasure chests like woxikon and kanoo to filter the text (what's already known, can be ignored, a bold assumption that has to be checked): Eg the word nest for 'gehen' (to go): Woxikon / Verbs / German / G / gehen DE GERMAN CONJUGATION OF GEHEN Total verb forms: 77 IMPERATIVES AND PARTICIPLES Partizip I gehend Partizip II gegangen Imperativ (Du) geh(e) Imperativ (Wir) gehen Imperativ (Ihr) geht TYPE ich du er/sie/es wir ihr sie Präsens Indikativ gehe gehst geht gehen geht gehen Präteritum Indikativ ging gingst ging gingen gingt gingen Futur I Indikativ werde gehen wirst gehen wird gehen werden gehen werdet gehen werden gehen Futur I Konjunktiv II würde gehen würdest gehen würde gehen würden gehen würdet gehen würden gehen Präsens Konjunktiv I gehe gehest gehe gehen gehet gehen Präteritum Konjunktiv II ginge gingest ginge gingen ginget gingen Perfekt Indikativ bin gegangen bist gegangen ist gegangen sind gegangen seid gegangen sind gegangen Plusquamperfekt Indikativ war gegangen warst gegangen war gegangen waren gegangen wart gegangen waren gegangen Futur II Indikativ werde gegangen sein wirst gegangen sein wird gegangen sein werden gegangen sein werdet gegangen sein werden gegangen sein Futur II Konjunktiv II würde gegangen sein würdest gegangen sein würde gegangen sein würden gegangen sein würdet gegangen sein würden gegangen sein Perfekt Konjunktiv I sei gegangen sei(e)st gegangen sei gegangen seien gegangen seiet gegangen seien gegangen Plusquamperfekt Konjunktiv II wäre gegangen wär(e)st gegangen wäre gegangen wären gegangen wär(e)t gegangen wären gegangen VERBS SIMILAR TO GEHEN geien gießen CONJUGATED VERBS BEFORE AND AFTER GEHEN gegenüberstellen gegenübertreten gehaben geheimhalten geheimtun « gehen » gehenlassen gehorchen gehren gehören geien MORE ACTIONS FOR GEHEN Rhymes for gehen Synonyms for gehen Translations and info for gehen So how can these man years of valuable linguistic information be used for term spotting? That's an interesting question for a master thesis. About your request: I'll ask the French department.
Actually ON the iPhone ;)
Login to post a comment