Speed up Your Translation Processes with Across v7
The new version of the Across Language Server and the Across Translator Edition is now available! We have addressed numerous subject areas in order to improve the user-friendliness, to reduce flow times, and to enable new working styles.
Get your Across Translator Edition v7
1. More Morphers
Almost exactly two years ago, I published an article in the ATA Chronicle and ITI Bulletin that started like this:
"I've been very interested in morphology in translation technology. No, let me put that differently: I've been very frustrated that the translation environment tools we use don't offer morphology. There are some exceptions -- such as SmartCat, Star Transit, Across, and OmegaT -- that offer some morphology support. But all of them are limited to a small number of languages, and any effort to expand these would require painful and manual coding."
Later in the article I talked about how Lilt had just started using artificial intelligence to recognize morphology and make suggestions from the termbase based on that. (You can read the article right here.)
Not long afterward I started a discussion with SDL's Daniel Brockmann about their efforts to use morphology in some of their tools. This finally came out of beta at the end of last year (with Trados Studio 2019 SR 1), and I ran some tests on it last week. (In case you don't want to read the rest of this and just want to know the results: I really liked it!)
For those who would like some details:
The problem with morphology support (i.e., the ability of the underlying technology to recognize that different forms of one word all belong to one base version of that word as shown in terminology recognition, etc.) is that it's very tedious because it's language-specific. The developers of Across and Star Transit/TermStar have "solved" that problem by painstakingly coding specific rules for specific (very limited) languages. A tool like Lilt has used artificial intelligence to have the system learn morphology rules for various languages, and tools like OmegaT are using third-party tools to support "stemming" for a large number of languages.
In my eyes, however, it's a much larger problem to use the tediousness of that task as an excuse not to do anything about it -- especially when it comes to translation environment tools that naturally need comprehensive support for basic linguistic intelligence.
SDL Trados has now finally found a way to offer morphology support for a very large array of languages within their terminology tool (presently: Albanian, Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sami, Sanskrit, Serbo-Croatian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, and Ukrainian).
Ah, you say, MultiTerm? Nah, not quite. Rather than updating the good old MultiTerm with this functionality, this is available only in the cloud-based Language Cloud offering -- which also is home to SDL's machine translation product. In fact, both services are bundled. You can get free access to 200,000 characters of machine translation and a limited cloud-based terminology product (which you are not able to share, into which you can't import Excel files, and for which you can build only one terminology database that cannot have custom termbase definitions) and a $10/month access to a less restrictive model (see here for specifics).
I asked Daniel about the decision not to implement the "linguistic search" in the desktop-based MultiTerm. Here is what he said:
"MultiTerm is our product for local file-based and server-based/on-premise terminology management. That won't change in the next few years. However, MultiTerm does not have the prerequisites to be able to be transitioned to genuine cloud-based ways of working (which I always call 'the third way of working' beyond file- and server-based). Against that background, we have decided to reinvent terminology management. This is not unlike reinventing ourselves back then for Studio following up to Translator's Workbench. The transition phase will for sure take a long time, but it is important to be prepared to reinvent a piece of technology when the previous generation has matured to its full potential. Obviously, full roundtrip of data between Language Cloud and MultiTerm will be guaranteed, so users can migrate both ways as needed."
Makes sense to me. The only thing I would add is that I'm not sure it makes sense to pay for a full-fledged version of a transitional product.
SDL is making its cloud-based terminology component morphology-aware by using Elasticsearch, a third-party product, that in turn uses the open-source Hunspell token filter (which OmegaT also uses) to enable dictionary-based stemming.
While these termbases are cloud-based, they can be used within SDL Trados in exactly the same way as MultiTerm termbases, including the possibility of adding new term pairs to existing termbases right from within Studio's environment. When you select one of those termbases, you will have the choice between a "linguistic fuzzy search" (which is the one we are talking about) and a "character-based fuzzy search" (which is essentially the same kind of language-independent search based on matches of three letters or more that MultiTerm offers as well).
You can also see that you can adjust the fuzziness level for the linguistic search. (For the tests I ran, I used the default settings.)
Years ago, I was part of a multi-university morphology project that never really happened, but for the proposal I documented in a fairly extensive manner how the old MultiTerm search feature continuously failed (back then I chose MultiTerm as the example application because it is and was the market leader -- most other tools would have delivered similar results). Fortunately, I was still able to access the same files and termbase and was therefore able to compare the output pre- and post-linguistic search. Here are some of the results (they're in German, but you don't have to understand the language to understand what's happening):
The first column shows the word in the translatable file; the second the existing terms in the (then: MultiTerm) termbase that were present in the termbase but were erroneously not detected as matches; and in the third column you see terms that now were detected with the identical source text and an identical termbase, only this time in the Language Cloud and with morphology support. It was not perfect, but clearly a lot better than before. And not only better, but useful. You can also see that it not only supports morphological changes (gekauft -- kaufen, geladen -- laden) but also compound words, which -- believe me! -- is especially helpful in a language like German.
The same feature is also available for term verification (you know, the previously frustrating quality check that typically turned out way more false positives than real ones).
Speaking of false positives, the tool is not perfect. In the regular translation interface, there were also false "matches" both in the term recognition window as well as highlighted in the source text for being available in the termbase without that being the case. No doubt I could have changed that by playing with the fuzziness settings, but it was not annoying enough to worry about it.
So, why do I give this so much space? Because I think this is a really important feature in a really important tool, and, notably, I feel like we were listened to after having requested its implementation for years. Now it's here and we're all the better for it.
Artificial Intelligence based smart service solution to provide personalized assistance, information and instructions to support operation, maintenance, repair and diagnostics of products and to act compliant in regulated processes.
Smart Content Services powered by STAR PRISMA
Watch the short video for more information on PRISMA functionality and usage:
2. Even More Morphers
Two tools that have not done particularly much with morphology solutions include Memsource and memoQ. I'm saying "not particularly much" because they both use a very manual and labor-intensive way of individually marking up term endings.
In memoQ you can use the asterisk (*) character in termbase entries to note that either no or any kind of characters can follow and would still be considered a match ("support*" would match "support," "supported," "supporter," etc.), and the pipe character to specify a definite set of endings ("support|ed" matches only "support" and "supported" but not "supporter" etc.).
In Memsource the pipe character can also be used, albeit in a slightly different manner. Here it has to be set following the morphological stem. In this case, "support|ed" would match "support," "supported," "supporter," etc.)
Again, while helpful, it's tedious to enter all those fancy characters, so Stanislav Okhvat, the person behind the great TransTools, has come up with a clever external solution for just these two tools. He calls it -- appropriately -- Term Morphology Editor, and unlike many of his other tools, this one actually costs something ($75). (May I say that I'm glad Stanislav charges something? I'm a firm believer in paying for someone else's professional efforts.)
Anyway, this tool processes memoQ or Memsource termbases in an external format (so you will have to export them out of the respective tools, process them, and then re-import them) by running them either through the Hunspell engine or, in the case of Ukrainian and Russian, also through the more powerful Morpher engine. This automatically tags all relevant terms that you then will have to accept, alter, or reject. According to those choices (which are displayed in a very easy-to-use interface), the tool will set the memoQ- or Memsource-internal morphology markers (| and/or *) so that upon reimport into the translation environment the termbases will be morphologically enabled -- at least in some languages.
The tool officially supports morphology suggestions for Albanian, Bulgarians, Croatian, Czech, Dutch, Estonian, French, Georgian, Italian, Kazakh, Latvian, Lithuanian, Polish, Portuguese (Brazil), Portuguese (Portugal), Romanian, Russian, Slovak, Slovenian, Spanish, Turkish, Ukrainian, and Welsh. As mentioned above, Russian and Ukrainian are über-supported because you can also use the slick Morpher service (which, by the way, does cost something if you process more than 100 terms a day).
German and other languages that use noun compounding or are agglutinative (e.g., Danish, Swedish, and Norwegian or Turkish and Finnish) are still not supported for the automatic suggestion system, and "English dictionaries will be included in future versions after I determine how to combine various available Hunspell dictionaries like Canadian, UK, US, etc."
Of course, you theoretically can use the tool for all languages (aside from bi-di and East Asian languages) to have a better way of processing your termbases manually, but I'm not sure I would pay the $75 for that. If, however, I worked in Memsource or memoQ in one or more of the fully supported languages, I would be all over this tool.
PDF Translation for Professionals
Translating PDFs is easier and quicker with TransPDF.
- Available within Memsource and memoQ
- Compatible with all CAT tools.
- Fast log-in for Proz members.
Try it FREE for your next PDF project
3. The Tech-Savvy Interpreter - Product Review: KOSS CS300-USB Headset (Column by Barry Slaughter Olsen)
One of the technology questions I am asked most frequently by interpreters is: "What headset should I buy for remote interpreting?" In October 2015, with the help of Michael Graves from ZipDX, I published "Choosing a USB Headset for Remote Interpreting," which included a list of headsets on the market back then that worked well for remote simultaneous interpretation. There have been many new headset models introduced since them. In late 2018, American headset manufacturer KOSS released a new USB headset called the CS300-USB (list price: US$49.00). It caught my attention for two reasons:
First, I've been using the KOSS Porta Pro headphones in the booth for almost 15 years. I'm now on my second pair. They are, hands down, the best sounding, most economical and highly portable headphones I've ever used. If you work on the Washington, D.C. market you will see many colleagues who swear by them, me included. They are analog headphones and have a 3.5 mm plug that works with just about any interpreter console on the market today. So when I saw KOSS had designed a new USB headset, I immediately took notice.
Second, the CS300-USB is the first ISO-compliant headset I have come across. ISO Standard 20109 (Simultaneous Interpretation - Equipment - Requirements) requires that "the microphone and the headphones shall correctly reproduce audio frequencies between 125 Hz and 15,000 Hz ± 10 dB." With the efforts underway to standardize many aspects of remote interpreting, using a headset with both ISO compliant headphones and microphone is important, particularly when relay interpreting is involved. The CS300-USB headset far exceeds the standard with the headphones capable of reproducing between 20 Hz and 22,000 Hz, and the microphone operating range is from 20 Hz to 16,000 Hz.
Unlike most USB headsets on the market, the CS300-USB does not have an in-line volume control or mute button -- often referred to as a "line lump." This simplifies the operation of the headset by making the user control all volume and muting functions through the computer. Although I usually like having an in-line volume control, I don't miss the mute button at all. I've lost track of the number of times that I have forgotten to unmute the headset and then turned on the microphone on the interpreting platform only to have my audience not hear me at all. In-line mute button, good bye and good riddance!
The headset is equipped with a flexible boom and an electret noise-cancelling microphone. The noise-cancelling is most welcome, as it helps cut down on any extraneous ambient noise such as a squeaky chair, noise from a keyboard or possible bleed from the headphones if you have the volume high. One drawback of the current design is that you can only have the microphone on the left side because the boom does not rotate 180 degrees and allow you to turn the headset around to wear the mic on the right side. This could easily be fixed in subsequent iterations of the headset.
The materials used to build the headset are of good quality. The plastics and fabrics used are flexible and comfortable. More importantly they do not make noise when you talk or adjust the headset. Some other headsets I have tested can actually squeak and creak on your head as you move your jaw to talk. These noises can be picked up by the microphone and can be very distracting for your listeners.
The headphones are comfortable and the rectangular headphones are cushy and rest on my ears well. The headset also has an extra-long eight-foot cord.
I've been using the CS300-USB for three months now. It has become my USB headset of choice in my interpreting studio. It doesn't really travel well but wasn't designed to either (Hint to KOSS: perhaps you could make a USB version of the Porta Pro?) So, for now, I carry another USB headset for use with my laptop when I'm on the road and suitcase space is limited. All in all, this is a comfortable, reliable, easy-to-use USB headset. If you are looking for a USB headset for remote interpreting or even if you just need one for web conferencing, this one is at the top of the list.
Do you have a question about a specific technology? Or would you like to learn more about a specific interpreting platform, interpreter console or supporting technology? Send us an email at firstname.lastname@example.org.
Imagine all the world's best dictionaries at your fingertips!
For a fixed monthly fee, on all your devices, integrated in your daily work applications!
WordFinder Unlimited - One service, 5 applications and more than 300 dictionaries in 26 languages.
WordFinder Unlimited was voted the winner of the Process Innovation Challenge at LocWorld in Warsaw 2018.
Click here to find out 10 things you need to know about WordFinder Unlimited
4. Using Neural Machine Translation Beyond Post-Editing -- A Conversation
In the past I have conducted a number of back-and-forth email-based conversations with experts on topics that are interesting and useful to me and, hopefully, to the community at large. The following conversation turned out to be very useful as well, but it was not conducted as straightforwardly as some of the others. Why? Well, it turned out that my discussion partner and I made certain assumptions as we communicated that the other either did not understand right away or that were muddied by our own preconceived ideas. In practice, this meant that we went back and forth a number of times to amend our questions and answers. It also meant that both of us realized this is a "problem" not just for Félix and me; instead, it might be a symptom of many discussions, whether between the machine translation development community and translators, or even between translators with different specializations and different language combinations where the needs, tools, and language requirements demand different solutions.
Since this super-interesting conversation turned out to be longer than both of us expected, I will have the first part in this edition of the Tool Box Journal and the next in the following edition (which should come in no time since this one is so late).
JOST: Today I'm starting a conversation with Félix do Carmo, a translator and now also machine translation researcher, about better usability practices for the professional use of machine translation. Félix, do you want to first briefly introduce yourself?
FELIX: I graduated in 1992, with a language degree that included a two-year specialization in Translation. Two years later, a few colleagues and I opened TIPS, a translation company specialized in Portuguese, and I took the role of managing director. I enrolled in a Masters in Translation Studies that the University of Porto offered, and after finishing it in 1998, I started teaching Translation Technologies to university students and teachers. In 2010, I took the opportunity to work on a PhD, which allowed me to learn and collaborate with computer scientists and get to know the insides of Machine Translation. My project, which I finished in 2017, focused on studying how to describe and support post-editing. And suddenly the opportunity came for a fellowship in the ADAPT Centre, in Dublin City University, which allowed me to work as a researcher with people like Joss Moorkens, Dorothy Kenny, and Andy Way, and to try to influence MT researchers to develop tools for translators, rather than autonomous devices. So, although I am not producing translated and revised words, I see myself as a translator, playing different roles in the world of translation, and taking all the opportunities I can to learn as much as possible about my profession.
JOST: Like you I have also been interested in working with machine translation, not so much from the angle of traditional post-editing but more in using machine translation suggestions as one of a number of data sources to help translators in the translation process. I've been particularly eager to find good ways to use translation environment tools to semi-automatically use partial data from machine-translated segments. That certainly seemed to be a good way when working with statistical machine translation. I wonder what difference it makes that we now (typically) use neural machine translation. Are the results of neural machine translation usable in the same way as the results of statistical machine translation?
FELIX: That's an interesting question. To answer it, we probably need to start with some technical information about the different systems. In statistical machine translation (SMT), the decoder which "translates" is essentially a search algorithm. For each word and group of words in the new source sentence, it consults the phrase table, which contains aligned words and groups of words from the training data, and it extracts the best equivalent. So, the approach is paradigmatic: each source word creates a slot, which may be filled in by any word in the phrase table. The search algorithm looks for the best fit for that slot, as if it was looking for LEGO pieces, slotted into position in a vertical, top-down movement, so the resulting sentences sometimes are awkward, with syntactical errors, and elements that do not go well together.
Neural machine translation (NMT) decoders work differently. The decoder does not search for LEGO pieces from tables of aligned phrases. Instead, it uses neural networks first to learn and then to identify the best sequences to translate full sentences. This is done from the mathematical representations of the sentences it learned from big amounts of parallel data, and this mathematical data is only converted into words in the last stage of composing the translation. NMT tries to construct a sequence horizontally, linearly, not from the top down, but beginning to end, with each sequence of previous words determining the next word. So, it is as if the system works syntagmatically: first, it learns the design of the puzzle, and only later it knows which pieces form that design. That focus on the sequence, the syntagmatic view of language, is what makes NMT more fluent than SMT, since the connection between the elements that compose a target sentence are more tightly knit together.
So, when you and I think of MT output being disassembled into pieces which may be fed separately to a translator, we are thinking in terms of the SMT models, but this does not describe what happens in typical NMT models. NMT is not conceived to output partial data, but whole sequences.
JOST: But SMT was not conceived to be used that way either, it just happened to be generated that way. Wouldn't it make sense to say that a translation suggestion that comes from a neural engine has valuable parts as well, no matter whether the whole sentence sounds more fluent as a whole? And also independent of whether the suggestion was put in there as parts (as in SMT) or in the sequential manner you're describing for NMT? If that is so, then I don't completely understand why an automated fragment search does not make sense when working with NMT. But I'm very interested in what we can do with the machine translation suggestions once the (non-interactive) MT engine has "done its job" and presented the suggestion within the translation environment. Technologically speaking, now it's the task of the translation environment tool to present the usable parts of the suggestion. Speaking from a workflow perspective, this typically means that it's the translator's keystrokes that enable the tool to present suitable fragments.
FELIX: You are right: If we start from the point of already having full suggestions and we want to know how to extract information from them, then we should not be discussing NMT and whether it fundamentally affects this process of choosing the best solutions. Like you say, it is no longer the task of the MT engine but that of the translation environment tool to present the words that you want to use from the full suggestions it receives.
This again is a search problem, and there are many approaches for these complex problems. The sheer nature of linguistic data, so variable as we know it is, makes searching linguistic items an even harder problem than usual. You suggest that typed keystrokes should bring up the correct suggestions from the different sources you have. But can you be sure the full suggestions from MT engines contain the words you want to write? For example, you may have several synonyms in two or three different suggestions, but not the one you are looking for. So, it is probably not enough for the algorithm to do a simple search in these suggestions, and it will need to look in other sources (monolingual data perhaps) for the word you are typing. But under which conditions or rules should this search be done, for it to be effective and efficient?
JOST: I agree that the current search mechanisms that are based on keystrokes are not advanced. There are no fuzzy features, or there is certainly no linguistically-driven search for synonyms or the like, but maybe that's not even what's needed. After all, the translator may not want to see a fuzzy match or a synonym if they have already decided to go with a with a certain term. What I take from this, though, is that there is no real difference in "harvesting" fragments from already-generated MT suggestions, no matter whether they come from SMT or NMT.
Let me ask you this: What other kind of developments or maybe under-researched areas are you looking at that would make NMT useful beyond "just" post-editing it?
FELIX: Let us think about the current scenario in the translator's desktop, in which, as you say, "machine translation suggestions is one of a number of data sources to help translators in the translation process." Although new sources of data bring new solutions, they also bring new problems. We may say that the impact of NMT in the translator's desktop is still globally under-researched. Let me discuss a few examples of issues that are not currently being researched enough.
NMT still requires very big amounts of training data, resorting to more data than TMs usually hold. This means that NMT will always present hypotheses which will create new conflicts with translators' local resources. Although research says that NMT produces "better output," this definition of quality is usually measured in isolated and simulated scenarios. We need different evaluation factors and metrics to understand how useful "better output" actually is in real scenarios.
For us to discuss how we can move "beyond post-editing," and thus to help translators develop new ways of working, we need to talk about the translation process itself. I believe there is still too much fog created by the introduction of the term "post-editing" in the industry, and we need to take a step back and try to get a clear view of what we call translation and what we call post-editing. Let me try to briefly express my view on this.
If your system feeds fragments of suggestions to the translator so he can write the translation, what the translator is doing is actually translating, not post-editing. The translator has to generate the translation in his mind before he chooses to accept or change each word or phrase that is being presented dynamically to him. That is why we talk about a high cognitive load in this process, because the translator's thoughts are constantly being interrupted by the support system. Most of these systems are known as "Interactive Machine Translation," but I would call the process "Interactive Human Translation," because the resulting translation comes from that mental process. There isn't enough research on these cognitive loads and the effects of such things as increased productivity in a regular work life.
Post-editing, on the other side, essentially involves editing, which is only possible when the translator is presented with a full suggestion by the MT system that is good enough for him to read and, instead of thinking about a full translation alternative, identify parts of the suggestion that require editing. In a PE project, a translator edits some sentences, but he may also need to translate quite a few. So, in PE you have not only editing, but also translating. The threshold from which a translator is no longer editing but he is actually translating is another under-researched area I am interested in.
But let us not fool ourselves into thinking that when we talk about editing we are talking about a simpler and easier-to-learn-and-automate task. If we go back to the paradigmatic and syntagmatic approaches -- one identifying slots and filling them in, the other more concerned with the relations in a sequence -- even editing involves those two dimensions in a very difficult-to-predict decision process. Editing may be broken down into four actions: deleting, inserting, replacing, and moving words. Only replacing is "simply" paradigmatic: you identify a slot which is occupied by the wrong word and you replace it. Moving a word is a good example of a syntagmatic action, because you mess with the structure your MT suggestion constructed. And estimating these actions is not easy: it has been demonstrated that estimating all options of new positions of an element in a sequence is one of the hardest mathematical problems you can ask a computer to do. Again, more research is needed into the patterns of editing, and on how to create assistants that support these processes.
JOST: That's really interesting -- but really theoretical. What's being done in academia with NMT in a more practical manner to move beyond "post-editing," as vague as that term might be?
You can see Félix's answer to this and other question in the next Tool Box Journal.
March Sale! 35% off SDL Trados Studio 2019 Freelance
Get 35% off SDL Trados Studio 2019 Freelance in our March Mayhem Sale. If you're looking to upgrade to SDL Trados Studio 2019, you can get 30% off upgrades.
Don't miss out, these special offers end on March 31. Get your special offer here >>
|5. New Password for the Tool Box Archive
As a subscriber to the Premium version of this journal you have access to an archive of Premium journals going back to 2007.
You can access the archive
. This month the user name is toolbox and the password is xuanzang.
New user names and passwords will be announced in future journals.
|The Last Word on the Tool Box Journal
If you would like to promote this journal by placing a link on your website, I will in turn mention your website in a future edition of the Tool Box Journal. Just paste the code you find
into the HTML code of your webpage, and the little icon that is displayed on that page with a link to my website will be displayed.
If you are subscribed to this
with more than one email address, it would be great if you could unsubscribe redundant addresses through the links Constant Contact offers below.
Should you be interested in reprinting one of the articles in this
journal for promotional purposes, please contact me for information about pricing.
© 2019 International Writers' Group