Lauro Beltrão Costa
I finished my PhD in Electrical and Computer Engineering at the University of British Columbia and joined Google as Software Engineer. At Google, I worked in Cloud AI with Natural Language/Document AI team, previously I worked with Google AdX and DBM teams.
Publications: https://scholar.google.com/citations?user=v2shjncAAAAJ
Past:
At UBC, my advisor was Matei Ripeanu, and I worked in the NetSysLab. My research focused on automating the configuration of intermediate storage systems. I was also involved in the Totem project (Graph Processing in Hybrid Architectures).
During the summers of 2009, 2010, and 2013 I had a great time at Google as an intern for the Site Reliability Engineering team for ContentAds (2009), and as Software Engineer intern for the Cluster Management team (2010), and the Google Feedback team (2013) in Mountain View, CA, USA.
Before starting at UBC, I worked at the LSD (Distributed Systems Lab) of UFCG (Universidade Federal de Campina Grande) in Brazil where I worked as software engineer and assistant researcher (2006-2008), and I concluded my BSc (2003) and MSc studies (2005) contributing to OurGrid project. As part of my research in OurGrid, I undertook an internship at Hewlett Packard Labs in Palo Alto, CA, USA. In 2008, I was also an intern at Fraunhofer ITWM in Kaiserslautern, Germany contributing to their Jawari project for three months.
Research Areas
Authored Publications
Sort By
Glean: Structured Extractions from Templatic Documents
Proceedings of the VLDB Endowment (2021), pp. 997-1005
Preview abstract
Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones.
We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.
View details