Jump to Content
Wolfgang Macherey

Wolfgang Macherey

Wolfgang Macherey joined Google in 2006 as a research scientist, where he works in the machine translation group with Franz Josef Och. He has been working on natural language processing since 1996.

Wolfgang worked as a Research Assistant at RWTH Aachen University from 1999 to 2005. His main research interests are in statistical machine translation and automatic speech recognition with the focus on discriminative training methods, natural language processing, statistical pattern recognition, and machine learning.

He received a PhD in Computer Science from RWTH Aachen University, Germany, in 2010 and his Diploma Degree in Computer Science from RWTH Aachen University in 1999 with a major in statistical pattern recognition and a minor in physical chemistry and thermodynamics.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract We present Mu2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on un-labeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition(ASR), Automatic Speech Translation (AST)and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu2SLAM trains ona sequence-to-sequence masked denoising objective similar to T5 on both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoSTAST, Mu2SLAM establishes a new state-of-the-art for models trained on public datasets, improv-ing on xx-en translation over the previous best by 1.9 Bleu points and on en-xx translation by 0.9 Bleu points. On Voxpopuli ASR, our model matches the performance of a mSLAM model finetuned with a RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks. View details
    Preview abstract In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%. View details
    Preview abstract In this paper we share findings from our effort towards building practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results across three research domains: (i) Building clean, web-mined datasets by leveraging semi-supervised pre-training for language-id and developing data-driven filtering techniques; (ii) Leveraging massively multilingual MT models trained with supervised parallel data for over $100$ languages and small monolingual datasets for over $1000$ languages to enable translation for several previously under-studied languages; and (iii) Studying the limitations of evaluation metrics for long tail languages and conducting qualitative analysis of the outputs from our MT models. We hope that our work provides useful insights to practitioners working towards building MT systems for long tail languages, and highlights research directions that can complement the weaknesses of massively multilingual pre-trained models in data-sparse settings. View details
    Preview abstract Multilingual neural machine translation (NMT) typically learns to maximize the likelihood of training examples from a combination set of multiple language pairs. However, this mechanical combination only relies on the basic sharing to learn the inductive bias, which undermines the generalization and transferability of multilingual NMT models. In this paper, we introduce a multilingual crossover encoder-decoder (mXEnDec) to fuse language pairs at instance level to exploit cross-lingual signals. For better fusions on multilingual data, we propose several techniques to deal with the language interpolation, dissimilar language fusion and heavy data imbalance. Experimental results on a large-scale WMT multilingual data set show that our approach significantly improves model performance on general multilingual test sets and the model transferability on zero-shot test sets (up to $+5.53$ BLEU). Results on noisy inputs demonstrates the capability of our approach to improve model robustness against the code-switching noise. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level. View details
    Preview abstract Recently, self-supervised pre-training of text representations has been success-fully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve dramatic success on resource-rich NMT. In this paper, we propose a joint training approach, F2-XEnDec, to jointly self-supervised and supervised train NMT models. To this end, a new task called crossover encoder-decoder (XEnDec) is designed to entangle their representations. The key idea is to combine pseudo parallel sentences (also generated byXEnDec)) used in self-supervised training and parallel sentences in supervised training through a second crossover. Experiments on two resource-rich translation benchmarks, WMT’14English-German and English-French, demonstrate our approach achieve substantial improvements over the Transformer. We also show that our approach is capable of improving the model robustness against input perturbations, in particular for code-switched perturbations. View details
    Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
    David Grangier
    Viresh Ratnakar
    Qijun Tan
    Transactions of the Association for Computational Linguistics, vol. 9, pp. 1460-1474
    Preview abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research. View details
    Preview abstract We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data. View details
    Preview abstract In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, in which the crucial one is a novel vicinity distribution for adversarial sentences that describes a smooth interpolated embedding space centered around observed training sentence pairs. We then discuss our approach, AdvAug, to train NMT models using the embeddings of virtual sentences in sequence-tosequence learning. Experiments on ChineseEnglish, English-French, and English-German translation benchmarks show that AdvAug achieves significant improvements over the Transformer (up to 4.9 BLEU points), and substantially outperforms other data augmentation techniques (e.g. back-translation) without using extra corpora. View details
    Preview abstract We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, developing our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them. View details
    Preview abstract There has been great progress in improving streaming machine translation, a simultaneous paradigm where the system appends to a growing hypothesis as more source content becomes available. We study a related problem in which revisions to the hypothesis beyond strictly appending words are permitted. This is suitable for applications such as live captioning an audio feed. In this setting, we compare custom streaming approaches to re-translation, a straightforward strategy where each new source token triggers a distinct translation from scratch. We find re-translation to be as good or better than state-of-the-art streaming systems, even when operating under constraints that allow very few revisions. We attribute much of this success to a previously proposed data-augmentation technique that adds prefix-pairs to the training data, which alongside wait-k inference forms a strong baseline for streaming translation. We also highlight re-translation's ability to wrap arbitrarily powerful MT systems with an experiment showing large improvements from an upgrade to its base model. View details
    Preview abstract End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora of speech and translated transcript pairs. Previous studies have proposed the use of pre-trained components and multi-task learning in order to benefit from weakly supervised training data, such as speech-totranscript or text-to-foreign-text pairs. In this paper, we demonstrate that using pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly supervised data into speech-to-translation pairs for ST training can be more effective than multi-task learning. Furthermore, we demonstrate that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance. Finally, we discuss methods for avoiding overfitting to synthetic speech with a quantitative ablation study. View details
    Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
    Naveen Ari
    Semih Yavuz
    Ruoming Pang
    Wei Li
    Colin Raffel
    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics, Florence, Italy (2019), pp. 1313-1323
    Preview abstract Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far. We do so by introducing Monotonic Infinite Lookback (MILk) attention, which maintains both a hard,monotonic attention head to schedule the read-ing of the source sentence, and a soft attention head that extends from the monotonic head back to the beginning of the source. We show that MILk’s adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values. View details
    Preview abstract Neural machine translation (NMT) suffers from the vulnerability to noisy perturbations in the input, which can cause a model trained on the clean data to behave abnormally on the noisy input. We propose an approach to improving the robustness of NMT models, which consists of two parts: (1) attack the translation model with adversarial source examples; (2) defend the translation model with adversarial target input to be robust against adversarial source input. For the generation of adversarial input, we propose to use a gradient-based method to craft adversarial examples that are advised by the translation loss in NMT based on the clean input. Experimental results on Chinese-English and English-German translation tasks demonstrate that our approach achieves significant improvements on the standard clean data and performs robustness on the noisy data. View details
    Preview abstract We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task. View details
    Preview abstract We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide in-depth analysis of various aspects of model building that are crucial to the quality and practicality towards universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research. View details
    Preview abstract Translating characters instead of words or word-fragments has the potential to simplify the processing pipeline for neural machine translation (NMT), and improve results by eliminating hyper-parameters and manual feature engineering. However, it results in longer sequences in which each symbol contains less information, creating both modeling and computational challenges. In this paper, we show that the modeling problem can be solved by standard sequence-to-sequence architectures of sufficient depth, and that deep models operating at the character level outperform identical models operating over word fragments. This result implies that alternative architectures for handling character input are better viewed as methods for reducing computation time than as improved ways of modeling longer sequences. From this perspective, we evaluate several techniques for character-level NMT, verify that they do not match the performance of our deep character baseline model, and evaluate the performance versus computation time tradeoffs they offer. Within this framework, we also perform the first evaluation for NMT of conditional computation over time, in which the model learns which timesteps can be skipped, rather than having them be dictated by a fixed schedule specified before training begins. View details
    Preview abstract The past year has witnessed rapid advances in sequence-to-sequence (seq2seq) modeling for Machine Translation (MT). The classic RNN-based approaches to MT were first out-performed by the convolutional seq2seq model, which was then out-performed by the more recent Transformer model. Each of these new approaches consists of a fundamental architecture accompanied by a set of modeling and training techniques that are in principle applicable to other seq2seq architectures. In this paper, we tease apart the new architectures and their accompanying techniques in two ways. First, we identify several key modeling and training techniques, and apply them to the RNN architecture, yielding a new RNMT+ model that outperforms all of the three fundamental architectures on the benchmark WMT'14 English to French and English to German tasks. Second, we analyze the properties of each fundamental seq2seq architecture and devise new hybrid architectures intended to combine their strengths. Our hybrid models obtain further improvements, outperforming the RNMT+ model on both benchmark datasets. View details
    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
    Mike Schuster
    Mohammad Norouzi
    Maxim Krikun
    Qin Gao
    Apurva Shah
    Xiaobing Liu
    Łukasz Kaiser
    Stephan Gouws
    Taku Kudo
    Keith Stevens
    George Kurian
    Nishant Patil
    Wei Wang
    Jason Smith
    Alex Rudnick
    Macduff Hughes
    CoRR, vol. abs/1609.08144 (2016)
    Preview abstract Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system. View details
    Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
    Chris Dyer
    Franz Och
    Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, ACL and AFNLP (2009), pp. 163-171
    Lattice-based Minimum Error Rate Training for Statistical Machine Translation
    Franz Och
    Ignacio Thayer
    Jakob Uszkoreit
    Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 725-734
    Preview abstract Minimum Error Rate Training (MERT) is an effective means to estimate the feature function weights of a linear model such that an automated evaluation criterion for measuring system performance can directly be optimized in training. To accomplish this, the training procedure determines for each feature function its exact error surface on a given set of candidate translations. The feature function weights are then adjusted by traversing the error surface combined over all sentences and picking those values for which the resulting error count reaches a minimum. Typically, candidates in MERT are represented as N-best lists which contain the N most probable translation hypotheses produced by a decoder. In this paper, we present a novel algorithm that allows for efficiently constructing and representing the exact error surface of all translations that are encoded in a phrase lattice. Compared to N-best MERT, the number of candidate translations thus taken into account increases by several orders of magnitudes. The proposed method is used to train the feature function weights of a phrase-based statistical machine translation system. Experiments conducted on the NIST 2008 translation tasks show significant runtime improvements and moderate BLEU score gains over N-best MERT. View details
    Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation
    Franz Och
    Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 620-629
    Preview abstract We present Minimum Bayes-Risk (MBR) decoding over translation lattices that compactly encode a huge number of translation hypotheses. We describe conditions on the loss function that will enable efficient implementation of MBR decoders on lattices. We introduce an approximation to the BLEU score~\cite{papineni01} that satisfies these conditions. The MBR decoding under this approximate BLEU is realized using Weighted Finite State Automata. Our experiments show that the Lattice MBR decoder yields moderate, consistent gains in translation performance over N-best MBR decoding on Arabic-to-English, Chinese-to-English and English-to-Chinese translation tasks. We conduct a range of experiments to understand why Lattice MBR improves upon N-best MBR and also study the impact of various parameters on MBR performance. View details
    Improving Word Alignment with Bridge Languages
    Franz Och
    Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, 209 N. Eighth Street, East Stroudsburg, PA, USA (2007)
    Preview abstract We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for combining word alignment systems from multiple bridge languages. The final translation is obtained by consensus decoding that combines hypotheses obtained using all bridge language word alignments. We present experiments showing that multilingual, parallel text in Spanish, French, Russian, and Chinese can be utilized in this framework to improve translation performance on an Arabic-to-English task. View details
    An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems
    Franz J. Och
    Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Association for Computational Linguistics, 209 N. Eighth Street, East Stroudsburg, PA, USA, pp. 986-995
    Preview abstract This paper presents an empirical study on how different selections of input translation systems affect translation quality in system combination. We give empirical evidence that the systems to be combined should be of similar quality and need to be almost uncorrelated in order to be beneficial for system combination. Experimental results are presented for composite translations computed from large numbers of different research systems as well as a set of translation systems derived from one of the best-ranked machine translation engines in the 2006 NIST machine translation evaluation. View details
    Minimum Exact Word Error Training
    Ralf Schlueter
    Hermann Ney
    Automatic Speech Recognition and Understanding (2005), pp. 186-190
    Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
    Lars Haferkamp
    Ralf Schlueter
    Hermann Ney
    Europ. Conf. on Speech Communication and Technology (2005), pp. 2133-2136
    Discriminative Training with Tied Covariance Matrices
    Ralf Schlueter
    Hermann Ney
    8th Int. Conf. on Spoken Language Processing (ICSLP) (2004), pp. 681-684
    Adaptation in Statistical Pattern Recognition Using Tangent Vectors
    Hermann Ney
    Joerg Dahmen
    IEEE Trans. Pattern Analysis Machine Intelligence, vol. 26 (2004), pp. 269-274
    A Comparative Study on Maximum Entropy and Discriminative Training for Acoustic Modeling in Automatic Speech Recognition
    Hermann Ney
    Proc. European Conference on Speech Communication and Technology (2003), pp. 493-496
    Probabilistic Aspects in Spoken Document Retrieval
    Joerg Viechtbauer
    Hermann Ney
    EURASIP Journal on Applied Signal Processing (2003), pp. 115-127
    Probabilistic Retrieval Based On Document Representations
    Joerg Viechtbauer
    Hermann Ney
    Int. Conf. on Spoken Language Processing (2002), pp. 1481-1484
    Towards Automatic Corpus Preparation for a German Broadcast News Transcription System
    Hermann Ney
    Int. Conf. on Spoken Language Processing (2002), pp. 733-736
    Improving Automatic Speech Recognition Using Tangent Distance
    Joerg Dahmen
    Hermann Ney
    European Conference on Speech Communication and Technology (2001)
    Comparison of Discriminative Training Criteria and Optimization Methods for Speech Recognition
    Ralf Schlueter
    Boris Mueller
    Hermann Ney
    Speech Communication, vol. 34 (2001), pp. 287-310
    Learning of Variability for Invariant Statistical Pattern Recognition
    Joerg Dahmen
    Hermann Ney
    European Conference on Machine Learning (ECML) (2001), pp. 263-275
    A Combined Maximum Mutual Information and Maximum Likelihood Approach for Mixture Density Splitting
    Ralf Schlueter
    Boris Mueller
    Hermann Ney
    Europ. Conf. on Speech Communication and Technology (1999), pp. 1715-1718
    Comparison of Discriminative Training Criteria
    Ralf Schlueter
    Int. Conf. on Acoustics, Speech, and Signal Processing (1998), pp. 493-496
    Implementierung und Vergleich diskriminativer Verfahren fuer Spracherkennung bei kleinem Vokabular
    RWTH Aachen University (1998), pp. 123
    Comparison of Optimization Methods for Discriminative Training Criteria
    Ralf Schlueter
    Stephan Kanthak
    Hermann Ney
    Lutz Welling
    Europ. Conf. on Speech Communication and Technology (1997), pp. 15-18