Any day now, that Slate Desktop will be released.
The big battle can then be held. Who will win?
Slate Desktop's automatic statistical techniques or my own manual statistical techniques?
My fragments list should reach 1 million lines soon. Ever line has been carefully collected and optimised in the past years.
And a second question will be:
Which results will be better:
You are the perfect test case, since you do only one kind of text, and have collected masses of nearly identical bilingual data (very much unlike myself, sadly). I'm curious what you fill find!
Thanks for taking the time to respond here!
>Slate Desktop's true benefit the reduction in skill level necessary to create new engines for a variety of uses.
Yes, I can see that. However, in my agency I constantly get new subjects. The biggest part of the work is to coin the terminology. This is a very time-consuming process and I often have to revise my terminology while moving along the translation.
This is manual (and intellectual) labour, and it will always stay that. There's no way that SD or any other MT system will be able to do this work for me.
So, I'll have to do it anyway. This coining, I mean. And I also define the snippets as large as possible, to improve the auto-assembling result. Here I could probably save some time, and let SD identity the snippets.
However, how on earth can SD know the genus, casus etc. of the nouns, in order to correctly identify snippets. It can't. It isn't rule-based. And when I start working on a new subject, I don't have a big corpus that could be used to train SD and let it identify snippets.
Not to speak of the fun part in translation.
At least, how I, with my very peculiar mind set and unique hardwiring (not appreciated by less tolerant persons: they really get extremely mad about my autonomous thinking, see the many kind replies that Hans van de Broek (who obviously has a different workflow -- which is fine for my, but the other way around obviously isn't) has posted here, whenever I have the evil courage to post something some new ideas here or elsewhere.
This coining and constant puzzling, re-coining and constantly optimising my snippets/terms, my constantly improving myself, is the big fun of translating for me.
Learn new things, chance my modus operandi constantly, re-invent workflows, improve. Ignore silly people. Don't let them irritate me. That's the hard part.
But dumb people with rusty patterns, nog being able to renew themselves should not withhold me to improve myself and my work.
Well now, back to SD: even if it could learn the coining, it wouldn't be good for me: it would probably take away much of the my professional fun.
I'm not sure. Let me test the software later this year.
Thanks for sharing your processes and personal insights of “who is Hans van de Broek.” We make no claims that SD can satisfy every business process. You say the coining “is manual... and it will always stay that.” I tend to agree with you.
You ask, “how on earth can SD know the genus, casus etc. of the nouns, in order to correctly identify snippets.” Simply put, it can not. I started using new terminology to describe what it does do. SD (all SMT, but I'm "branding" here a little) is a concordance search engine. I was formulating the term as this group razzed Michael about his work style to search mega-concordances a few weeks ago.
A traditional concordance has no part-of-speech or right-vs-wrong components. It’s a saved collection of the most frequent n-gram groups. How do they determine “most frequent?” At some point someone counted everything. I propose traditional concordances have been limited to the most frequent entries simply because the technical requirements to save/search/retrieve from an exhaustive collection are massive. Is it coincidence that SMT has been criticized for requiring massive computing resources?
SD’s underlying SMT technology creates bilingual concordances consisting of co-occurring n-grams between the two languages. The training process counts co-occurring n-grams, calculates and saves frequencies. The resulting “knowledge base” contains not only 3-grams or 4-grams... it's every grouping of “n” grams from 1 to “X” (typically 7).
The co-occurrent groupings serve as context windows that enable SMT to make a pseudo "linguistic" distinctions. For example, when do you translate "read" as present vs simple-past? It’s the concordance n-grams context surrounding the source token, not from the token itself. So, the token “yesterday” within 7 grams of “read” in English tips balance so the Dutch translation draft has the simple-past, not the present tense.
But that's only the front-end of the engine commonly called the “translation model.” The back end is the language model, or "style guide" as I call it. This is a concordance of all n-grams of the target language. Together, the front/back ends work together to search for and select the most likely combination (based on frequency) of tokens using the concordance contexts (not the simple word).
For me, the “fun” part of this technology is NOT mouse-clicks, progress bars and MT connectors. Those are necessary evils to serve a broader range of users. The fun part for me is all the creative thinking about corpus selection and manipulation. It’s the precursor to the boring computer time that create engines.
It’s the corpus selection/preparation that enables SD restores spacing and casing to tokenized, segmented, lower-cased text of any language with accuracy ranging from 98% to 99.8% of the naturally occurring text. That’s even for difficult languages like Chinese, Japanese, Thai. By changing only the corpus selection and manipulation, SMT then technology translates “bag of word” tasks for key-word and headline translation at eBay. Change the corpus and processing again, and it “translates” 75% corrupted OCR output to natural text to within 99% accuracy of the original. Change it again, and the EU DGT translates complex legal works.
That diversity tell me that there are MANY more uses for this technology if a) people take time to learn and b) more people can use it (not a web page facade). We strive to help you reuse those works resulting from your knowledge in the most efficient way possible. Based on your sharing, I predict you’ll also enjoy and excel in the real fun part… finding creative alternatives to select and prepare the corpus. I’ll be happy to share my experiences and walk though your ideas with you about how you can re-use your work to create new tools, but you have to do the work… You own the results and you protect your works because you don't share them with anyone.
Currently installing my 2nd copy of Slate on a spare desktop PC in my office, which I will use exclusively to build MT engines ;-)
Oops, realised my previous screenshot contained my Teamviewer ID/password. ;-)
Here's an edited version:
By the way, building on my work laptop seems to not be getting any further. It has been stuck on this since last night now:
(fast i7, 32 GB RAM, 3×SSD)
Should I cancel it?
Michael, I re-posted this picture in a similar topic on our support forum to keep the knowledge/exchanges shared in one place.
Step 5 of this processing stage can take a while, but it is also the step most likely to have trouble. Let's discuss at this link: Generation time
Thanks Tom, will discuss further over there!
Please keep us informed, Michael!
H(ans van den Broek)