Google Research

Let's Parse to Prevent Pwnage

  • Mike Samuel
  • Úlfar Erlingsson
USENIX workshop on Large-Scale Exploits and Emergent Threats, USENIX (2012)


Software that processes rich content suffers from endemic security vulnerabilities. Frequently, these bugs are due to data confusion: discrepancies in how content data is parsed, composed, and otherwise processed by different applications, frameworks, and language runtimes. Data confusion often enables code injection attacks, such as cross-site scripting or SQL injection, by leading to incorrect assumptions about the encodings and checks applied to rich content of uncertain provenance. However, even for well-structured, value-only content, data confusion can critically impact security, e.g., as shown by XML signature vulnerabilities [12].

This paper advocates the position that data confusion can be effectively prevented through the use of simple mechanisms—based on parsing—that eliminate ambiguities by fully resolving content data to normalized, clearly-understood forms.

Using code injection on the Web as our motivation, we make the case that automatic defense mechanisms should be integrated with programming languages, application frameworks, and runtime libraries, and applied with little, or no, developer intervention. We outline a scalable, sustainable approach for developing and maintaining those mechanisms. The resulting tools can offer comprehensive protection against data confusion, even when multiple types of rich content data are processed and composed in complex ways.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work