Start a new topic

Slate Desktop vs. Hans

Any day now, that Slate Desktop will be released.


The big battle can then be held. Who will win?


Slate Desktop's automatic statistical techniques or my own manual statistical techniques?


My fragments list should reach 1 million lines soon. Ever line has been carefully collected and optimised in the past years.


And a second question will be:


Which results will be better:


  • SD collecting all fragments on its own and then perform MT or
  • SD using my carefully, manually collected list of fragments and then perform MT?

You are the perfect test case, since you do only one kind of text, and have collected masses of nearly identical bilingual data (very much unlike myself, sadly). I'm curious what you fill find!


Michael

Hi Hans,

To spread the announcement I released earlier today, we set the release date for February 16. So, I'm eager to learn, too.

There are many ways to skin a cat. You may know Terence Lewis and his EN-NL system. It outperforms Slate Desktop (actually DoMT, but the underlying technology is Moses) for work within the confines of its design for Siemens work. Slate Desktop's true benefit the reduction in skill level necessary to create new engines for a variety of uses. For Terence to redesign his Siemens system to work in another field would be a significant undertaking. We're not here to make SD the only tool because it gives the best results. We're working towards a day when it's another vital tool that helps you work your way.

I can't answer your question about which way SD will perform better. My experience has shown the only real proof of the pudding is in the eating :) I'm happy to work with you and share my experience about non-standard ways to prepare training data. Your approach is interesting and similar to some other work we've done for clients. I'm assume by "manually collected list of fragments" you are talking about the parallel date. So, don't forget about the target  model corpus. SD simply won't work without it, unless you want to perform major surgery on the underlying C++, recompile and rebuild the whole system :)

Tom

Hi Tom,


Thanks for taking the time to respond here!


>Slate Desktop's true benefit the reduction in skill level necessary to create new engines for a variety of uses.


Yes, I can see that. However, in my agency I constantly get new subjects. The biggest part of the work is to coin the terminology. This is a very time-consuming process and I often have to revise my terminology while moving along the translation.


This is manual (and intellectual) labour, and it will always stay that. There's no way that SD or any other MT system will be able to do this work for me.


So, I'll have to do it anyway. This coining, I mean. And I also define the snippets as large as possible, to improve the auto-assembling result. Here I could probably save some time, and let SD identity the snippets.


However, how on earth can SD know the genus, casus etc. of the nouns, in order to correctly identify snippets. It can't. It isn't rule-based. And when I start working on a new subject, I don't have a big corpus that could be used to train SD and let it identify snippets.


Etc.

Not to speak of the fun part in translation


At least, how I, with my very peculiar mind set and unique hardwiring (not appreciated by less tolerant persons: they really get extremely mad about my autonomous thinking, see the many kind replies that Hans van de Broek (who obviously has a different workflow -- which is fine for my, but the other way around obviously isn't) has posted here, whenever I have the evil courage to post something some new ideas here or elsewhere.


This coining and constant puzzling, re-coining and constantly optimising my snippets/terms, my constantly improving myself, is the big fun of translating for me.


Learn new things, chance my modus operandi constantly, re-invent workflows, improve. Ignore silly people. Don't let them irritate me. That's the hard part.


But dumb people with rusty patterns, nog being able to renew themselves should not withhold me to improve myself and my work.


Well now, back to SD: even if it could learn the coining, it wouldn't be good for me: it would probably take away much of the my professional fun.


I'm not sure. Let me test the software later this year.

Thanks for sharing your processes and personal insights of “who is Hans van de Broek.” We make no claims that SD can satisfy every business process. You say the coining “is manual... and it will always stay that.” I tend to agree with you.


You ask, “how on earth can SD know the genus, casus etc. of the nouns, in order to correctly identify snippets.” Simply put, it can not. I started using new terminology to describe what it does do. SD (all SMT, but I'm "branding" here a little) is a concordance search engine. I was formulating the term as this group razzed Michael about his work style to search mega-concordances a few weeks ago.


A traditional concordance has no part-of-speech or right-vs-wrong components. It’s a saved collection of the most frequent n-gram groups. How do they determine “most frequent?” At some point someone counted everything. I propose traditional concordances have been limited to the most frequent entries simply because the technical requirements to save/search/retrieve from an exhaustive collection are massive. Is it coincidence that SMT has been criticized for requiring massive computing resources?


SD’s underlying SMT technology creates bilingual concordances consisting of co-occurring n-grams between the two languages. The training process counts co-occurring n-grams, calculates and saves frequencies. The resulting “knowledge base” contains not only 3-grams or 4-grams... it's every grouping of “n” grams from 1 to “X” (typically 7).


The co-occurrent groupings serve as context windows that enable SMT to make a pseudo "linguistic" distinctions. For example, when do you translate "read" as present vs simple-past? It’s the concordance n-grams context surrounding the source token, not from the token itself. So, the token “yesterday” within 7 grams of “read” in English tips balance so the Dutch translation draft has the simple-past, not the present tense.


But that's only the front-end of the engine commonly called the “translation model.” The back end is the language model, or "style guide" as I call it. This is a concordance of all n-grams of the target language. Together, the front/back ends work together to search for and select the most likely combination (based on frequency) of tokens using the concordance contexts (not the simple word).


For me, the “fun” part of this technology is NOT mouse-clicks, progress bars and MT connectors. Those are necessary evils to serve a broader range of users. The fun part for me is all the creative thinking about corpus selection and manipulation. It’s the precursor to the boring computer time that create engines.


It’s the corpus selection/preparation that enables SD restores spacing and casing to tokenized, segmented, lower-cased text of any language with accuracy ranging from 98% to 99.8% of the naturally occurring text. That’s even for difficult languages like Chinese, Japanese, Thai. By changing only the corpus selection and manipulation, SMT then technology translates “bag of word” tasks for key-word and headline translation at eBay. Change the corpus and processing again, and it “translates” 75% corrupted OCR output to natural text to within 99% accuracy of the original. Change it again, and the EU DGT translates complex legal works.


That diversity tell me that there are MANY more uses for this technology if a) people take time to learn and b) more people can use it (not a web page facade). We strive to help you reuse those works resulting from your knowledge in the most efficient way possible. Based on your sharing, I predict you’ll also enjoy and excel in the real fun part… finding creative alternatives to select and prepare the corpus. I’ll be happy to share my experiences and walk though your ideas with you about how you can re-use your work to create new tools, but you have to do the work… You own the results and you protect your works because you don't share them with anyone.


Good to see, Michael. BTW, Pieter Beens has been hammering the system with all variants of XLIFF files (SDLXLIFF, MQSXLIFF, MXLIFF and XLF). We've been updating support and fixing bugs in several wild frenzies. I'll push a new installer in a few minutes. When this engine is complete, run the new installer. It will leave your engine untouched and update with all the fixes. We owe Pieter a lot for his dedication to the testing!

Currently installing my 2nd copy of Slate on a spare desktop PC in my office, which I will use exclusively to build MT engines ;-)

At first glance, "Building two MT engines simultaneously" scared me until I read further that you're using 2 machines. This process is too resource-hungry to do it in parallel on one machine. Some sub-steps will stop when they detect another instance, but we need to add a top-level detection.

Re two activation, we updated our licensing to allow running on 2 machines specifically for the use case you describe. That's a direct result of feedback during the Indiegogo campaign.

 

Oops, realised my previous screenshot contained my Teamviewer ID/password. ;-)

Here's an edited version:



By the way, building on my work laptop seems to not be getting any further. It has been stuck on this since last night now:



(fast i7, 32 GB RAM, 3×SSD)


Should I cancel it?


Michael


Michael, I re-posted this picture in a similar topic on our support forum to keep the knowledge/exchanges shared in one place.


Step 5 of this processing stage can take a while, but it is also the step most likely to have trouble. Let's discuss at this link: Generation time


Tom

Thanks Tom, will discuss further over there!


Michael

Please keep us informed, Michael!


H(ans van den Broek)

Login to post a comment