Shumeet Baluja

Shumeet Baluja

Shumeet Baluja, Ph.D., is currently a Senior Staff Research Scientist at Google, where he works on a broad set of topics ranging from image processing and machine learning to wireless application development and user interaction measurement. He is the author of the forthcoming book The Silicon Jungle: A Novel of Deception, Power, and Internet Intrigue.

Shumeet was formerly the Chief Technology Officer (CTO) of JAMDAT Mobile, Inc., where he oversaw all aspects of technology initiation, development and deployment. Previously, Shumeet was Chief Scientist at Lycos Inc., where he led the Research and Applied Technology Group in the quantitative and qualitative analysis of user behavior, including data mining and trend analysis, advertisement optimization, and statistical modeling of traffic and site interactions. As Senior Vice President of R&D at eCompanies LLC, he spearheaded the creation of their wireless practice and was responsible for finding and evaluating new technologies and technology entrepreneurs.

Shumeet has filed numerous patents and has published scientific papers in fields including computer vision and facial image processing, advertisement display and optimization, automated vehicle control, statistical machine learning, and high-dimensional optimization. Shumeet graduated from the University of Virginia with a Bachelor of Science, with high distinction, in 1991. He received his Ph.D. in computer science from Carnegie Mellon University in 1996.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Hand-drawn doodles present an interesting and difficult set of textures to model and synthesize. Unlike the typical natural images that are most often used in texture synthesis studies, the doodles examined here are characterized by the use of sharp, irregular, and imperfectly scribbled patterns, frequent imprecise strokes, haphazardly connected edges, and randomly or spatially shifting themes. The almost binary nature of the doodles examined makes it difficult to hide common mistakes such as discontinuities. Further, there is no color or shading to mask flaws and repetition; any process that relies on, even stochastic, region copying is readily discernible. To tackle the problem of synthesizing these textures, we model the underlying generation process of the doodle taking into account potential unseen, but related, expansion contexts. We demonstrate how to generate infinitely long textures, such that the texture can be extended far beyond a single image's source material. This is accomplished by creating a novel learning mechanism that is taught to condition the generation process on its own generated context -- what was generated in previous steps -- not just upon the original material. View details
    Not All Network Weights Need to Be Free
    Michele Covell
    21st IEEE International Conference on Machine Learning and Applications, ICMLA 2022, Bahamas, December 12-14, 2022
    Preview abstract As state of the art network models routinely grow to architectures with billions and even trillions of learnable parameters, the need to efficiently store and retrieve these models into working memory becomes a more pronounced bottleneck. This is felt most severely in efforts to port models to personal devices, such as consumer cell phones, which now commonly include GPU and TPU processors designed to handle the enormous computational burdens associated with deep networks. In this paper, we present novel techniques for dramatically reducing the number of free parameters in deep network models with the explicit goals of (1) model compression with little or no model decompression overhead at inference time and (2) reducing the number of free parameters in arbitrary model without requiring any modifications to the architecture. We examine four techniques that build on each other, and provide insight into when and how each technique operates. Accuracy as a function of free parameters is measured on two very different deep networks: ResNet and Vision Transformer. On the latter, we find that we can reduce the number of parameters by 20\% with no loss in accuracy. View details
    Adding Non-linear Context to Deep Networks
    Michele Covell
    IEEE International Conference on Image Processing(2022)
    Preview
    Visualizing Semantic Walks
    NeurIPS-2022 Workshop on Machine Learning for Creativity and Design, https://neuripscreativityworkshop.github.io/2022/
    Preview abstract An embedding space trained from both a large language model and vision model contains semantic aspects of both and provides connections between words, images, concepts, and styles. This paper visualizes characteristics and relationships in this semantic space. We traverse multi-step paths in a derived semantic graph to reveal hidden connections created from the immense amount of data used to create these models. We specifically examine these relationships in the domain of painters, their styles, and their subjects. Additionally, we present a novel, non-linear sampling technique to create informative visualization of semantic graph transitions. View details
    Preview abstract Numerous methods have been proposed to transform color and grayscale images to their single bit-per-pixel binary counterparts. Commonly, the goal is to enhance specific attributes of the original image to make it more amenable for analysis. However, when the resulting binarized image is intended for human viewing, aesthetics must also be considered. Binarization techniques, such as half-toning, stippling, and hatching, have been widely used for modeling the original image's intensity profile. We present an automated method to transform an image to a set of binary textures that represent not only the intensities, but also the colors of the original. The foundation of our method is information preservation: creating a set of textures that allows for the reconstruction of the original image's colors solely from the binarized representation. We present techniques to ensure that the textures created are not visually distracting, preserve the intensity profile of the images, and are natural in that they map sets of colors that are perceptually similar to patterns that are similar. The approach uses deep-neural networks and is entirely self-supervised; no examples of good vs. bad binarizations are required. The system yields aesthetically pleasing binary images when tested on a variety of image sources. View details
    Contextual Convolution Blocks
    Proceedings of the British Machine Vision Conference 2021(2021)
    Preview abstract A fundamental processing layer of modern deep neural networks is the 2D convolution. It applies a filter uniformly across the input, effectively creating feature detectors that are translation invariant. In contrast, fully-connected layers are spatially selective, allowing unique detectors across the input. However, full connectivity comes at the expense of an enormous number of free parameters to be trained, the associated difficulty in learning without over-fitting, and the loss of spatial coherence. We introduce Contextual Convolution Blocks, a novel method to create spatially selective feature detectors that are locally translation invariant. This increases the expressive power of the network beyond standard convolutional layers and allows learning unique filters for distinct regions of the input. The filters no longer need to be discriminative in regions not likely to contain the target features. This is a generalization of the Squeeze-and-Excitation architecture that introduces minimal extra parameters. We provide experimental results on three datasets and a thorough exploration into how the increased expressiveness is instantiated. View details
    Interpretable Actions: Controlling Experts with Understandable Commands
    Michele Covell
    The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), AAAI(2021)
    Preview abstract Despite the prevalence of deep neural networks, their single most cited drawback is that, even when successful, their operations are inscrutable. For many applications, the desired outputs are the composition of externally-defined bases. For such decomposable domains, we present a two-stage learning procedure producing combinations of the external bases which are trivially extractable from the network. In the first stage, the set of external bases that will form the solution are modeled as differentiable generator modules, controlled by the same parameters as the external bases. In the second stage, a controller network is created that selects parameters for those generators, either successively or in parallel, to compose the final solution. Through three tasks, we concretely demonstrate how our system yields readily understandable commands. In one, we introduce a new form of artistic style transfer, learning to draw and color with crayons, in which the transformation of a photograph or painting occurs not as a single monolithic computation, but by the composition of thousands of individual, visualizable strokes. The other two tasks, single-pass function approximation with arbitrary bases and shape-based synthesis, show how our approach produces understandable and extractable actions in two disparate domains. View details
    Seamless Audio Melding: Using Seam Carving with Music Playlists
    Michele Covell
    Proceedings of 2020 International Conference on Advances in Multimedia, IARIA, pp. 24-29
    Preview abstract In both studio and live performances, professional music DJs in an increasing number of popular musical genres mix recordings together into continuous streams that progress seamlessly from one song to the next. When done well, these create an engaging and seamless experience, as if they were part of a single performance. This work introduces a new way to provide that continuity using not only beat matching, but also frequency-dependent cross fades. The basis of our technique is derived from the well developed technique of visual-seam carving, most commonly found in computer vision and graphics systems. We adapt visual seam carving to indicate the times at which to transition specific frequencies from one song to the next. Additionally, we also describe a way to invert the melded spectrogram with minimal computation. The entire system works faster than real-time to provide the ability to use this system in live performances. View details
    Immediate Gestalt: Shapes, Typography and (Quite Irregular) Packing
    Journal of Mathematics and the Arts, to Appear(2020), pp. 1-22
    Preview abstract Instantaneously understanding the gestalt of thousands of words is achieved through the programmatic placement of the words and control of their presentation characteristics, such as size, repetition, and font. As early as the fourteenth century, words were used as building blocks for images. Hundreds of years later, this typographic experiment continues with the addition of raw computational power. The ability to place thousands of words in interesting forms gives rise to a quantitatively different form of expression. The resulting procedures are expressive enough to represent shapes, textures, and shading automatically. Though based on approaches for addressing the classic problem of algorithmic two-dimensional bin-packing, aesthetically pleasing results are achieved through the incorporation of a small set of rules to guide the layout. View details