Start a new topic

REQ: Enhancing Frequent words feature (term extraction)

CafeTran’s “Frequent words” feature (Task > Frequent words) is pretty handy, especially given the ability to quickly search resources (including MT engines) for translations, and to easily add terms/fragments to a glossary or fragment-enabled TM.


Being able to conduct such instant frequency searches on a batch of imported files and on any file format handled by CafeTran is a big plus, compared to external term extractors which usually support just a handful of file types.


While there are several (paid or free) specialized external term extractors, some of which use linguistic, syntactical and morphological algorithms on top of the statistical analysis, running an integrated statistical search on the imported text can be enough for many translators needs.


However, I feel this feature could be enhanced to extend its usefulness.


Here’s a proposal in that direction:


- The most important addition would be to improve the handling of noise. For example, recurring empty/function words such as “and” or “with” can be removed from the list and from the subsequent suggested term candidates. This could take the form of a “stop word” list, a simple text file (just like nontranslatables.txt, shortcuts.txt or abbreviations.txt) listing words to be ignored in the word frequency analysis. In addition to the list, an “Ignore” button (or « Noise » button) could be added besides the “Add” and “Search” ones, allowing the user to quickly add a word or words to the Ignore list. A similar action (Add to ignore list) could be considered for the Resources or Task > Frequent words menu.


- On the current implementation, the string of words included in the word frequency results has no word limit: whole segments (and subsegment variations) are present in the Frequent words list. While this may serve a function, and Frequent words is not only meant to spot terminology or phraseology candidates, in general, it only adds more noise, since terms are usually not that long. Therefore, I suggest that a (maximum/minimum) number of words per term can be user-defined in the Preferences (with 0 or no entry lifting the limit). For instance, the maximum number of words per term could default to 3 words, and the minimum to 1. I’m thinking of a subsection in the General or the Workflow pane.


- A last option could be a checkbox for applying Match case to the frequency results.


I think the above additions could really make a difference for anyone wishing to conduct integrated terminology/phraseology searches in their projects.


This could help make “Frequent words” task a valuable step in the already rich translation workflow CafeTran is able to offer.


To me, the advantaged far outweigh the "complexity" introduced by the additional suggested items in the Preferences and/or the Menu. If I was not convinced of the above statement and the enhanced usefulness, I would not suggest such additions.


What do you think?


Jean


Jean: This could help make “Frequent words” task a valuable step in the already rich translation workflow CafeTran is able to offer.

....What do you think?


You asked for it. I think the Frequent Words feature is not a part of CafeTran's core actions, and an enhanced FW feature would be even less so.


It's not useful for the everyday jobs of freelance translators, and should be left to linguistically educated specialists using specialised software.


A waste of time and resources of both Igor and us. Unfortunately, if I claim it, His Igorness will almost certainly implement it.


image



H.

I agree this does not fall under the core actions.


It is an optional step which makes sense in some translation workflows.


A statistical (frequency) analysis has the advantage of being applicable to any language.


If CafeTran offers this feature, why not improve its usability? In its present implementation, I tend not to use it because of the excessive noise produced, although I see it has real potential.

Jean: If CafeTran offers this feature, why not improve its usability?


Because CafeTran shouldn't provide it at all. I use Antconc, "free" and multiplatform. It's easy, and can do all the things you want it to do. You may want to use CafeTran to convert the document in a Project to .txt, but as a freelance translator, you'll have to load the document anyway.


It is an optional step which makes sense in some translation workflows.


I doubt it. I can see a use for agencies, but we are freelance translators, and thank heavens, CafeTran is there for freelance translators.


H.


> I think the Frequent Words feature is not a part of CafeTran's core actions


I do not agree.


> It is an optional step which makes sense in some translation workflows.

 

Seconded. Indeed the handling of noise is one important point, but the apostrophe bug (alreday mentioned here, see below, for the not-frog-eating community: it should be „de l’entreprise“) should be resolved, too.


image


Another good idea would be to have exact glossarxy or TM matches (optionally) displayed in another column on the right.

Jean: This could take the form of a “stop word” list, a simple text file (just like nontranslatables.txt, shortcuts.txt or abbreviations.txt) listing words to be ignored in the word frequency analysis.


I think you already achieve this in CafeTran. Try this:

  1. Download a list with stopwords
  2. Open it in column A in a spreadsheet
  3. In B2, you enter "blabla" (or whatever)
  4. Fill down column B
  5. Import the spreadsheet in CafeTran in a new glossary
  6. Enable "Show Only Unknown Words" in the Task menu (for Frequent words)
  7. Run Frequent Words, the terms in the list with stopwords will be ignored
  8. Close the tab with the Frequent Words in the tabbed pane

Et voilà, or whatever you frog-eaters say (quite nice frog legs, the way they get "harvested" is a bit less nice)

H.

Good catch, Hans.


However, without setting minimum/maximum words per term/fragment, I'm afraid a "stop list" won't be enough. It's the combination of these two features that would really help tackle excessive noise, in my view.


If a max/min word limit per term is set in the Preferences, implementing a stop list per locale (as is the case for abbreviations) or a selectable one in the Preferences (as for nontranslatables) would be more convenient that this work around (which also probably requires removing the glossary when done, to avoid interference with other resource suggestions).

Jean: ...without setting minimum/maximum words per term/fragment


I never used CafeTran's FW extraction, but assuming you can save the list, it should be perfectly easy to filter it for n words, and then delete the results. However, that would also delete very useful fragments of more than n-1 words.


...requires removing the glossary when done


Yes, see my point 8.


But I'm still wondering what the use of FW would be for mere translators like me, and what makes you so desperate to have it included in CafeTran. Examples, please.


H.

@Woorden

You did a good catch with your workaround, as it shows that all needed elements are already there. I hope this is a good starting point for Igor.

Concerning the other question: Yes, the FW feature is important to prepare bigger translations. It gives you (sometimes) a better idea of the text and helps to populate the glossary (and pimp Auto-Complete a bit). The FW feature is useless for smaller texts, for rush jobs and mostly for the users that are neglecting glossaries in favor of TMs.

In the sense metioned above, the (optional) display of exact glossarxy or TM matches could helpe to add alternative terms to existing entries.

 

Unless I'm missing something, the glossary you describe will need to be removed when done not for FW but for interference with term suggestions in the project.


n-1 words: this is why it can be useful to be able to change/adjust the max/min word settings.


For instance, I have tested FW on a project featuring various products and their ingredients. Since the products were similar, several ingredient lists were similar as well. This gave many FW occurrences where the very long ingredient lists were combined in any possible way on the frequency level. This, in addition to the empty words such as “and” “of” “with”, which may work well in phraseology fragments, but not so much to spot term candidates, produced a high level of noise which made the FW feature unusable.


To be clear, I am not desperate to have FW included in CafeTran. It already is present. There are external tools that tackle this, but improving the existing integrated solution makes sense.


Translators may not use a term extraction feature if it is impractical or if they need to do this outside their chosen CAT tool (I have seen a translation study confirming the low use of term extraction in general). But they may start to take advantage of such a feature if they have it just at their fingertips.


This is all my proposal is about: CONVENIENCE.


By making a feature convenient to use (and the “Search” and “Translate” FW buttons are exactly in that direction), you actually give it its full meaning.


To take an example, most Apple products or product features were not revolutionary in the technology department (the same or similar technology or products were already present out there), but in the usability department, making the same feature or product type actually easy, convenient, and enjoyable to use (another “revolutionary” thing is their marketing strategy, but this another discussion).

woorden: Examples, please.


In practice, I think most translators tend to perform ad hoc terminology and documentary-related research, while working on their texts and segments, when and as needed.


This works well on short assignments or for short deadlines. FW’s “Extract frequent words from current fragment” can help in this scenario.


This is not the only way to work.


Do you perform a preliminary reading of the text you are about to translate? Any preliminary documentary or terminology related research? Do you analyze the text first, to make a better grasp of the interpretation and translation difficulties, or estimate its level of technicality?


So here’s another method, especially relevant for longer texts and assignments, when you can really take the time to proceed as you wish.


You explore the text BEFORE starting to translate it, to understand it as whole (since we don’t translate words but their meaning in context, and we don’t translate unrelated segments and phrases, but a coherent text or at least segments that take on their full meaning within their given context). In that process, you can choose to perform preliminary documentary research, or address some recurring terminology issues. A good documentary research in the target language often helps solve terminology issues as well. These researches are not abstracted from the text, they are fully informed by this first reading. This preliminary research can help lay the groundwork, which can make the rest of the work easier, more efficient and more consistent. With interpretation and terminology issues partly out the way (not entirely, since additional on the spot research is always needed), you can focus on exploring your target language resources in order to best reformulate the original text.


In this context, having a working, integrated, convenient Frequent words/term extraction feature can help perform this analysis and groundwork. It can be seen as an integral part of such an approach.


Other examples:


- You need to make batch deliveries. To achieve consistency in translation choices, a preliminary work and analysis on the full project would be highly recommended.

- You need to quickly access a project before accepting a job. You skim through the content, but can also perform such a statistical analysis to have a better idea of the recurring terms. 

- You work on a team project (and CafeTran also caters for that). Establishing a consistent terminology which is informed by the frequent occurrence of certain terms can help build a glossary to ensure better consistency across the team.

- CafeTran is more geared towards individual translators, but it can be used in a agency or a larger organization context, where the above “team work” example is even more frequent or relevant 

- Even mere translators are sometimes requested to deliver or update a glossary along with their translation. At least I have been requested to do that in a number of occasions.


J.

Jean: To be clear, I am not desperate to have FW included in CafeTran.


Thank you for your kind cooperation.


H.

Jean: In practice, I think most translators tend to perform ad hoc terminology and documentary-related research, while working on their texts and segments, when and as needed.


And for a good reason. A practical reason. CONTEXT. Context is everything, as you know, and TW doesn't provide it.


When I tried AntConc, I had lots of fun, but I didn't see a single use for my work.


In this context, having a working, integrated, convenient Frequent words/term extraction feature can help perform this analysis and groundwork.


In AntConc, I first tried the stopwords file to reduce what you call the noice. And it reduces it by some 30-40%. In a larger (but not necessarily LARGE), you still end up with say 6.500 contextless words of a 10,000 words file. I tried to reduce that more by also deducting the 2,000 most used words. I don't remember how much that reduced the word count, but those words could very well be part of a fragment that's very relevant and specific. Anyway, using FW is neither efficient, nor helpful without context.


You need to make batch deliveries.


That's why you maintain a TM for Fragments (glossary for the challenged) of the Project.


...perform such a statistical analysis to have a better idea of the recurring terms


I don't think there's a relation between the frequency of recurring terms and their importance for the project. Not at all. The most frequently use terms in that project will probably the words/terms in the, err, list of frequent terms of a language.


You work on a team project 


Again: That's why you maintain a TM for Fragments (glossary for the challenged) of the Project.


...but it can be used in a agency or a larger organization context


I specifically mentioned "freelance translators." And I wonder if it's any useful for agencies (I had a project like that a week or two ago). No context.


Even mere translators are sometimes requested to deliver or update a glossary along with their translation.


See above (for the 3rd time).


H.




In my primary example, I’ve placed term extraction within an approach in which the translator reads the text and so makes context-aware decisions and searches regarding terminology and documentary research. Not out of context.


Improving FW will also add to the usefulness of “Extract frequent words from current fragment.”


For batch projects, I see your point but I beg to differ in one respect: of course TMs (for segments and fragments) will help consistency, but you can’t change the translations already delivered. Need to take a global approach for the project, to limit instances when you wish to go back and change the chosen translation, etc. FW can help in that respect. CafeTran is excellent for concordance searches, so the words that come up in the FW results don’t need to stay "out of context.”


In general, as I said in the first post, a complete term extraction solution would require the use of linguistic, syntactical and morphological algorithms on top of the statistical analysis.


This is outside CafeTran’s scope.


A statistical/frequency analysis can be applied to any language, which is a big advantage.


The issue with term extraction based only on word frequency analysis is the opposite of "noise": "silence". You are right that some important terms won’t be repeated or frequent in a text, to catch them requires an different approach as described above.


If CafeTran can tackle the "noise" issue, I think the "silence" issue can be a fair trade off.


In that case, FW can be a viable solution and useful option.


I only reluctantly mention third-party tools, because the point is not to make a comparison, my only objective being to help improve CafeTran:

  

AntConc: can you set the min/max words per term in this tool? I did not find it at a glance. I only use it occasionally for concordance, it’s very good at that. Okapi’s Rainbow term extraction offers this if you want to try it (both stop words and min/max words per term): http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm#Options_Lists


I hope Igor chimes in to comment on the original suggestion.


Again, no big drama if this is not implemented, it’s more a "nice to have" than a "must have.” But still, just picture it :-)

I'm not going to answer all your points, Jean, but:


If CafeTran can tackle the "noise" issue, I think the "silence" issue can be a fair trade off.


Nobody can tackle that, and that's my main argument against it. It's simply too much work, and nobody will use FW more than once.


AntConc: can you set the min/max words per term in this tool?


Yes, but do you realise n-grams will multiply the noise?


I hope Igor chimes in to comment on the original suggestion.


So do I.


H.

Login to post a comment