Start a new topic

Statistics question

Hi,


Can someone please explain the difference between the "translated segments" and "translated target segments" statistics? Thanks.


Alyssa


Hi Alyssa,


I think translated segments refers to source statistics and translated target segments refers to target statistics.


So the first shows the source words you have translated and the second the number of target words your translation has for the already translated segments.


Jean


1 person likes this

Would it add more clarity if the label 'Translated segments' would be changed to 'Translated source segments'?

Yes, definitely! I would expect translated source segments to be the highlighted statistic, since most of us get paid based on source word count (or am I wrong about this?)

No you aren't. But on the other hand I don't think that the nice purple is a highlighting colour. I think that it's to set it free from all source-related rows. 


Of course a great mind can come up with a solution here. Let us just wait until The Michael wakes up. If at all.

I swear the "speed in this session" etc. statistics at the bottom used to reflect source words. Is there a way to change the statistics display so everything is for source words? It doesn't help me to see all this for target words and makes me think I'm earning more money than I am, lol.

Hi to all,

Hi Igor,


First I admint, I love statistics and specially the CT statistic function.

Very complete, even though I admit it took me some time to understand. Still am puzzled. Reason why I write.


So translating lots of software I love the indication at the lower right corner:

image


Great! Never am too long for a field description.



Then I still do not understand - having read I guess all the posts about statistics - why one would get this difference:

image


Here are the corresponding details:

image

So obviously: 59% is the 'Translated source segment' indication.

But the 73% puzzles me ... nothing matches ...


So does someone understand what the 73% of the right bar refers to?


Finally I would like to understand the intention behind:

a) Project->Statistics->Memory statistics

versus

b) Project->Statistics->Total statistics


I always get the same result: (always labelled 'Memory Statistics')

image


My question is simple :

Is 'Memory statistics' meant to be used before starting the job ?

Maybe adapting match percentages according to the results ?


Kind regards to all,

Thomas












Hello to all,


Follow-up to my last post on statistics.

I wonder :

1) Would anyone be interested in having combined statistics?

When I start a job I have my TM and my glossary.

I would like to know how many glossary entries match with my source text.

This would give me a very good feeling about the time allocation needed.

If I get a text with a very low number of 'hits/occurrences' in my glossary, I know that I will have to do a lot of research..


So I suggest:

Project->Statistics->Project statistics

including document statistics + Glossary statistics with a new column 'found glossary entries'

or

just a separate Glossary statistic with a one single value: 'found glossary entries' if multiples glossaries are open/assigned to the project, the value should be indicated per glossary.

This way I see: ah, glossary A is useless. I can close it again and so on.


What do you think ?


Kindest regards,

Thomas



Hello Thomas,


the first progress bar shows the percentage of already translated content (number of characters?), the second (optional one) shows the progress in terms of segments.


Let’s say you have a project with 100 segments.


If you go to the 31st segment without translating anything, the first progress bar will be 0% and the second 31%.


Both progress bars are useful. If the project is already translated (progress bar at 100%), you can still have a good idea of your review progress, etc.


---


Memory statistics are obviously mostly interesting before starting a project, at the analysis phase.


---


I don’t see the interest of integrating glossaries in statistics.


Example: Take the IATE glossary. Every few words, it gives you matches. How many of these are relevant for your project?


Translators don’t translate words, but their meaning in context, and the corresponding word choices can vary greatly, that is why having the number of word/term matches in statistics would only be of marginal use. Do you think a glossary could be helpful for your project? Just add it (in read-only mode, for example), join the glossary tabs together and that’s it.


The Statistics function has its limit when helping understand how much a project might take you. You still need to review the files. This is the best analysis, and it can only be complemented by the CAT tool statistics function.


I’d say that when you first read and analyse a project (especially a big one), it can still be interesting to make a different kind of analysis: monolingual term extraction.


Monolingual term extraction attempts to analyze a text or corpus in order to identify candidate terms.


Unless you want to use the Task > Frequent words solution, you may want to take a look at this list for some external tools suggestions (depending on your source language): https://github.com/idimitriadis0/TranslateOnLinux/wiki/TranslateOnLinux#term-extraction


Jean

PS: You can also import a glossary as a Memory (Memory > Import > Import tab delimited memory, etc.), but that would only be useful for very short segments (UI options, etc.) in terms of statistics.

>Finally I would like to understand the intention behind:

>a) Project->Statistics->Memory statistics

>versus

>b) Project->Statistics->Total statistics


If you have a few TMs opened, the Statistics->Memory statistics shows you Statistics for each TM separately. Total Statistics calculates the matches for both Project and TMs in one go. 

Hello Jean, 


Thanks for your input. I understand what you mean, but do not agree. 

I am a translator and interpreter. 


A) The interpreter

I have a glossary per conference. Reusing the same glossary next time I have the 'same' conference again a year later or whenever.

This glossary has the same format as my translator main glossary and I always feed it into my translator glossary


B) The translator

I have a main glossary, but very often I do compile glossaries before starting a job. I do that according to CT word statistics. If I see there, that I get technical terms. 

I also have special glossaries based on a specific ISO standard and the like.


C) Under Preferences->Glossary I use the function 'Display longest match only' 

This in my humble opinion avoids having useless word collection


I agree, IATE is not helpful for that. I also use it offline all the time (on trains etc.)


Example of an entry:

#de                                            #fr

1.Tournummer des Tages        1er numéro de course de la journée


So now you understand that my glossary entries are not 'normal' word lists and why I use: 'Display longest match only'.


I know that I can turn a Glossary into a Memory but do not see the added value of it.


Kindest regards, 

Thomas

Hello Thomas,


Thank you for taking the time to explain your use case and clarify your point. It was a good read, to say the least. I now see why implementing matches at the word level could be useful in Statistics.


Thomas: I have a main glossary, but very often I do compile glossaries before starting a job. I do that according to CT word statistics.


Interesting. Would you care to expand on that? And do you mean Task > Frequent words, or something else? I have tried this when analyzing a project, in order to get a better grasp of its technicality and better orient my terminology research, but have found it only marginally useful in CTE. I now prefer to use external tools, such as those listed in the link of my previous post. English being my main source language, this leaves me with many options.

Jean



Hello Jean, 


Many thanks for your input to which I partially agree. 

First of all I have to say I am on Mac for 15 years. I moved away from Microsoft and Trados because I was sick and tired of the recurrent issues in Trados. Multiterm was IMHO me the best tool. I discovered Linux through Apple.

Back then Linux had no real alternative to offer for translators. Plus formats were quite problematic back then. So I picked up on Heartsome and since then only use software that is cross-platform in essence. LibreOffice / NeoOffice and - yes - Office for Mac to secure compatability.


I agree, Task > Frequent words produces lots of very useless stuff but also usefull one. This is why I believe in the power of the translator checking the document.

I do not believe in Online-solutions because I travel a lot as a interpreter. Online solutions do not work in trains, planes and remote sites/hotels. So the function Task > Frequent words combined with the logic

- look-up the glossary entries you find in the source text would actually be very usefull. In my example

#de                                            #fr

1.Tournummer des Tages        1er numéro de course de la journée

CT should ONLY look up '1.Tournummer des Tages' NOT '1', nor 'Tournummer', nor 'des' nor 'Tages'

That to me is 'longest match only'-logic.

Actually CT should NEVER look up single words. That is utterly useless. 


Finally, thanks for you references. 

Today Linux rocks!! Much better than back then. But Mac OS has lots of Linux in it ... :-)


Kindest regards

Thomas


Regarding Frequent words, I don't know if you are aware of this, but a few months ago, Igor added the ability to set the minimal and maximal fragment length.


You can define the minimum and maximum characters for each Frequent words entry that will be displayed by CafeTran.


In word-based languages, setting the minimum and maximum number of words (not characters) would make more sense, but this can still help limit excess noise (too long or too short repeated fragments).


To access this option, just right click inside the Frequent words tab.


I've included this information in the Menu and Interface reference document, see Task > Frequent words context menu.


---


Since you are looking for offline solutions, in addition to the Frequent words in CTE, you could test the "Term extraction" feature inside the Rainbow application of the Okapi Framework. Like CafeTran, it uses purely statistical analysis, but provides richer options that could prove useful to you.


Rainbow is a free-libre/open source cross-platform (Java) application.


Here's a possible workflow:


- Create the project in CafeTran and export a TXT file (Project Export and exchange > To TXT), then drag and drop the resulting file in the Input list 1 tab // Alternatively, just drag and drop the source file into the Input list 1 tab, since Rainbow can handle a multitude of file formats.

- In Rainbow, go to Utilities > Term extraction

- Select the options that you wish to use and press Execute.

- Open the resulting terms.txt Since it's tab delimited, you can also rename it as .csv and open it in a spreadsheet program.

- (optional) If you like this feature, you can save your specific preferences and reuse them by creating a custom "pipeline" (let me know if interested).


Attaching a screenshot of the Term extraction options in Rainbow.


image

Login to post a comment