Start a new topic

regex in glossaries, not for newbies

For the first time I am testing pipe character and regex in glossaries.


Discovered that pipe character is best for plural forms (in English at least).

frien|d


For verbs brackets normally help.

shar|[ei]

buil|[dt]


But it was a challenge to find this:

|\buse?d?i?\w+


Finds use, used, uses, using but not focused, focusing 


But even focused, focusing are highlighted if the segment contains another "use" (this was explained by Igor in another thread).


As for |\buse?d?i?\w+ I know it is not handy, waste of time during translation. Or perhaps there is an easier one to find it?


Fragment bases may be the solution. Created a recent video
From Proz: Russian virgin and derivatives 15:12 The Russian for virgin is дева (dyeva), Other words based on this are девушка (dyevushka), a girl; девочка (dyevochka), a young girl, i.e. a child; and девица (dyevitsa), which is pretty much the opposite of a virgin: a woman of dubious reputation. The Russian for "call-girl" is телефонная девица. That would be дев|а

Only one term match is displayed, why?


Sentence: "these signature knits were washed out"


Glossary entries:

1. washed [tab] yıkanmış

2. |\bwash?e?i?\w+ [tab] yıkamak

3. wash|[ei] [tab] yıkanmak


CafeTran normally displays alternative meanings but not in this case, full vs fuzzy match problem?

I think you shouldn't use tab del glossaries at all, but regexes happen to be the ekseption to that rule. However, I'd recommend to only use them in a separate glossary - for regexes only - with a low priority.


H.

Selçuk: Finds use, used, uses, using but not focused, focusing

 

True:


You should have left out the word boundary thing \b to find the other words:

H.




Selçuk: Only one term match is displayed, why?


The Magnificent Regex cum Glossary Experts can't seem to bother with your questions, so the enemy will give it a try.


It may be because glossaries only allow for exact matches - washed - and both the stemming and regex variants of the glossaries introduce fuzziness. They still should work, but if there's an exact match present, there's no use for fuzzy matches.


Interesting question.


H.

Ipse: They still should work, but if there's an exact match present, there's no use for fuzzy matches.


And it's yet another reason to keep glossaries that introduce fuzziness separated from the one(s) without, the main reason of course their incompatibility with standards/exchangeability. Otherwise, the Magnificent Regex cum Glossary Experts will again have to ask help from our Benevolent Leader to separate them, like they did with those ridiculous synonyms/alternative translations. And in this case, that won't be easy. On the other hand, both the stemming and the regex glossaries contain a pipe character, so they will be easy to find and... delete. A waste of time.


H.


It was my intention not to find focused, that is why I used the \b boundary. 


I will test again creating a secondary regex only glossary for such terms. Do I need regex in glossaries, probably no. But hope it is a good way to learn the basics.  


Selçuk: Do I need regex in glossaries, probably no


I never used them in a glossary, but I can imagine you getting a (large) job where they would come in handy. I'm still waiting for one, unfortunately.


But hope it is a good way to learn the basics.  


It most certainly is. That's my problem with scripting: You try to learn something  and the they come up with an example that doesn't mean a thing to me, like about pictures and RGB and things. The good thing about regexes is that they are about text.


H.

Everyone has his own prefs, but I always wonder whether single word expressions like:


For verbs brackets normally help.

shar|[ei]

buil|[dt]


really are worth the adding to a glossary. They take quite some time to figure out and enter them. Or am I mistaken and can the addition of entries like these be automated or typed blindly, once you know them?


Anyway, personally I reserve regular expressions for combinations of words, mostly when I have to span numbers. Much like the sentence patterns, but with numbers.

This is what you get too?



Is the regular expression:


|\bwash?e?i?\w+[TAB]yıkamak


really valid?

Also note that yıkanmaklarum is inserted twice into the Target segment pane, even though I have this set:




Note that Automatic insertion of matches is not ticked.

A lot of testing ;). You are looking for a generic rule, aren't you?



First two entries are displayed. Third one (the regular expression) isn't. Perhaps because it's not correct. Anyway: always place the regular expressions at the end of the glossary, to avoid non-matching of the other ones when the regular expressions contain an error.


Login to post a comment