Start a new topic

Should I use the pipe here?

I know almost nothing about regular expressions, but I would like to make my glossaries as compact as I could with regular expressions. For example, how could I instruct CT to show "depending on" for any of the following source terms?


a seconda di

a seconda dei

a seconda degli


Thanks in advance


Mario: I know almost nothing about regular expressions


An excellent reason for not using them. Regexes are extremely powerful, and therefore extremely dangerous.


That said, are you are you absolutely sure that a seconsa d cannot me followed by anything else than i, ei, or egli (I don't know Italian)?


H

Thank you Woordern. I forgot "a seconda delle", so four possibilities. But this is just an example. I don't want to learn the whole power of regular expressions, just the basic in order to make my glossary as neat as I can.

Mario: make my glossary as neat as I can


I have the following objections:

  1. At least in your example above, you'll only "win" a few bites
  2. I'm against the use of tab del glossaries. If your example above would have been in a TM for Fragments, CafeTran would recognise a seconsa in the Automatic workflow
  3. The only reason to use tab del glossaries for me is... regexes. However, they should be in a tab del for entries with regexes only. If not, your glossary will likely not be exchangeable, and cannot be used by yourself in another CAT tool.
H.

Grrrr. "bites" should be "bits" or "bytes" of course. F reshdesk posts cannot be edited unless you delete them first.

Did I ever mention that I don't like F resdesk?


H.

What you could do to save a few bits in your example (and remember I'm not a tab del glossary fan/expert): Create an entry like this:


a seconda di;a seconda dei;a seconda degli;a seconda delleTABdepending on


You can't use that in your DV lexicon either, I think, but contrary to a glossary with regexes, this can easily be resolved, and I even think CafeTran can do it for you.


H.

For most of my EN > IT work I don't use TMs, because 1) the text isn't much repetitive, if not at all, and 2) being the English source text written by Japanese people, it's plagued by errors, funny expressions or japanesized English. For technical EN > IT and JP > IT it's a different story, although I tend to use separate TMs instead of the big mama.


My main and most precious tool has always been the glossary, and this is why I am always looking for ways to make it smaller and cleaner, in particular now that I've just started to use CT. Maybe you are right: I should start using a TM for fragments.


Meanwhile I accept your suggestion not to touch regexes.

Mario: 1) the text isn't much repetitive


But because of the fuzziness TMs/TMX files offer (and tab del glossaries don't), TMs are still useful,


I tend to use separate TMs instead of the big mama


I use both, the Big Mama having the lowest possible priority.


My main and most precious tool has always been the glossary


But the times, they are a -changing. The DejaVu lexicon was indispensable, because it overruled everything else (except longer matches from the MDB/TDB), and there was no way to connect to more resources. MDB, TDB, Lexicon, that was it. Nowadays, you can connect to multiple resources, and assign priorities to them. I mentioned that quite recently on Herbert's DejaVu-L list, and I was accused of trying to "convert" them. I don't want to.


Meanwhile I accept your suggestion not to touch regexes.


By all means, "touch" them. They can be very useful. But be aware that they can be dangerous. I actually use them quite a lot, although not in CafeTran, where I usually can't see what they are doing.


H.

HANS: I'm against the use of tab del glossaries.

MICHAEL: Yes, we all know that by now, but it is silly to try to force your particular obsession on new users who have no knowledge of your ancient and by now very tedious personal crusade against tab-delimited glossaries.

 

HANS: The only reason to use tab del glossaries for me is... regexes.

MICHAEL: Nonsense, there are so many other good reasons why they are preferable. But I am not going to open that old can of worms.

 

HANS: However, they should be in a tab del for entries with regexes only. If not, your glossary will likely not be exchangeable, and cannot be used by yourself in another CAT tool.

MICHAEL: Also not true. While there is something to be said for keeping all regexed entries in a separate regex-only Glossary … it is also extremely easy to locate all lines in a txt Glossary with regexes and, either (1) delete them, or (2) edit them, so the file will be perfectly proccessable when porting it to another CAT tool's termbase system.

 

I agree that it is often simply easier to add the separate entries (which is what I do), and keep them nicely organised in a single Glossary entry (i.e. single line in the txt file) like this:

 

a seconda di;a seconda dei;a seconda degli;a seconda delle [TAB] depending on


~


@Hans: The reason (well, one of them) it is silly for you to push your personal agenda re using TMXs rather than Glossaries for terminology, is that you will merely end up confusing people like Mario, rather than helping them. He doesn't yet know that terminology in CafeTran can be handles in two ways:


1. Glossaries (tab-delimited .txt files; or .csv, or .tsv, for that matter, which is what I use) 

+

2. TMXs (with "Fragments memory" switched ON in TMX options)


image

So if you are going to push your "Philosophy" on him, and others, please at least explain the basics of CafeTran terminology before you do so.


Michael 

> a seconda d....


The following regular expression should catch all your examples:


|a seconda d.{1,4}?


Note that "|" at the start indicates this is a regular expression. 


1 person likes this

Michael: He doesn't yet know that terminology in CafeTran can be handles in two ways:


Do not underestimate Mario. He's not your usual victim, a newbie.

H.

Michael: explain the basics of CafeTran terminology before you do so.


The basics are there, Kmitowski and Dimitriopolos took care of that. I wish I could write a "Why" rather than another "How to". And I'm serious, for once.

H.

IK: The following regular expression should catch all your examples


I have no doubt it will. I don't know any Italian, not after say 400 AD, but are you perfectly sure your regex won't catch any four letter words (and I'm fond of them) or less that would be caught as well?


H.

@woorden: Regarding how little you trust the use of regexes in term matching, have you ever stopped to consider that you are implicitly placing a lot of trust in the fuzzy matching algorithm used by CafeTran when matching terms in its TMXs? After all, what are regexes but yet another algorithm?


Fuzziness can sometimes mean errors (as you well know), yet you are constantly singing the praises of fuzziness, but only if used with TMXs.


image



Michael

> are you perfectly sure your regex won't catch any four letter words


It will catch maximum four letters after "a seconda d". If the scope is too broad, the user can easily modify that reg. ex based on the provided example.


1 person likes this
Login to post a comment