Hamza Harkous

Hamza Harkous

Hamza Harkous is a Staff Research Scientist at Google, Zürich. He currently leads an effort to transform the data curation and model building process with large language models, driving advancements in privacy, safety, security, and beyond across Google's products. He previously architected the machine learning models behind Google’s Checks, the privacy compliance service. Prior to his tenure at Google, he worked at Amazon Alexa on natural language understanding and generation. He received his PhD in Computer Science from the Swiss Federal Institute of Technology in Lausanne (EPFL), where he also served as a postdoctoral researcher. During that time, he researched and developed tools for improving users’ comprehension of privacy practices and for automatically analyzing privacy policies. You can find more about his work on his personal homepage.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Website Data Transparency in the Browser
    Sebastian Zimmeck
    Daniel Goldelman
    Owen Kaplan
    Logan Brown
    Justin Casler
    Judeley Jean-Charles
    Joe Champeau
    24th Privacy Enhancing Technologies Symposium (PETS 2024), PETS (to appear)
    Preview abstract Data collection by websites and their integrated third parties is often not transparent. We design privacy interfaces for the browser to help people understand who is collecting which data from them. In a proof of concept browser extension, Privacy Pioneer, we implement a privacy popup, a privacy history interface, and a watchlist to notify people when their data is collected. For detecting location data collection, we develop a machine learning model based on TinyBERT, which reaches an average F1 score of 0.94. We supplement our model with deterministic methods to detect trackers, collection of personal data, and other monetization techniques. In a usability study with 100 participants 82% found Privacy Pioneer easy to understand and 90% found it useful indicating the value of privacy interfaces directly integrated in the browser. View details
    Preview abstract We present an analysis of 12 million instances of privacy-relevant reviews publicly visible on the Google Play Store that span a 10 year period. By leveraging state of the art NLP techniques, we examine what users have been writing about privacy along multiple dimensions: time, countries, app types, diverse privacy topics, and even across a spectrum of emotions. We find consistent growth of privacy-relevant reviews, and explore topics that are trending (such as Data Deletion and Data Theft), as well as those on the decline (such as privacy-relevant reviews on sensitive permissions). We find that although privacy reviews come from more than 200 countries, 33 countries provide 90% of privacy reviews. We conduct a comparison across countries by examining the distribution of privacy topics a country’s users write about, and find that geographic proximity is not a reliable indicator that nearby countries have similar privacy perspectives. We uncover some countries with unique patterns and explore those herein. Surprisingly, we uncover that it is not uncommon for reviews that discuss privacy to be positive (32%); many users express pleasure about privacy features within apps or privacy-focused apps. We also uncover some unexpected behaviors, such as the use of reviews to deliver privacy disclaimers to developers. Finally, we demonstrate the value of analyzing app reviews with our approach as a complement to existing methods for understanding users' perspectives about privacy. View details
    On the Potential of Mediation Chatbots for Mitigating Multiparty Privacy Conflicts - A Wizard-of-Oz Study
    Kavous Salehzadeh Niksirat
    Diana Korka
    Kévin Huguenin
    Mauro Cherubini
    The 26th ACM Conference On Computer-Supported Cooperative Work And Social Computing (CSCW)(2023) (to appear)
    Preview abstract Sharing multimedia content, without obtaining consent from the people involved causes multiparty privacy conflicts (MPCs). However, social-media platforms do not proactively protect users from the occurrence of MPCs. Hence, users resort to out-of-band, informal communication channels, attempting to mitigate such conflicts. So far, previous works have focused on hard interventions that do not adequately consider the contextual factors (e.g., social norms, cognitive priming) or are employed too late (i.e., the content has already been seen). In this work, we investigate the potential of conversational agents as a medium for negotiating and mitigating MPCs. We designed MediationBot, a mediator chatbot that encourages consent collection, enables users to explain their points of view, and proposes solutions to finding a middle ground. We evaluated our design using a Wizard-of-Oz experiment with N=32 participants, where we found that MediationBot can effectively help participants to reach an agreement and to prevent MPCs. It produced a structured conversation where participants had well-clarified speaking turns. Overall, our participants found MediationBot to be supportive as it proposes useful middle-ground solutions. Our work informs the future design of mediator agents to support social-media users against MPCs. View details
    CookieEnforcer: Automated Cookie Notice Analysis and Enforcement
    Rishabh Khandelwal
    Asmit Nayak
    Kassem Fawaz
    32th USENIX Security Symposium (2023)
    Preview abstract Online websites use cookie notices to elicit consent from the users, as required by recent privacy regulations like the GDPR and the CCPA. Prior work has shown that these notices are designed in a way to manipulate users into making websitefriendly choices which put users’ privacy at risk. In this work, we present CookieEnforcer, a new system for automatically discovering cookie notices and extracting a set of instructions that result in disabling all non-essential cookies. In order to achieve this, we first build an automatic cookie notice detector that utilizes the rendering pattern of the HTML elements to identify the cookie notices. Next, we analyze the cookie notices and predict the set of actions required to disable all unnecessary cookies. This is done by modeling the problem as a sequence-to-sequence task, where the input is a machine-readable cookie notice and the output is the set of clicks to make. We demonstrate the efficacy of CookieEnforcer via an end-to-end accuracy evaluation, showing that it can generate the required steps in 93.7% of the cases. Via a user study, we also show that CookieEnforcer can significantly reduce the user effort. Finally, we characterize the behavior of CookieEnforcer on the top 100k websites from the Tranco list, showcasing its stability and scalability. View details
    Preview abstract Integrating user feedback is one of the pillars for building successful products. However, this feedback is generally collected in an unstructured free-text form, which is challenging to understand at scale. This is particularly demanding in the privacy domain due to the nuances associated with the concept and the limited existing solutions. In this work, we present Hark, a system for discovering and summarizing privacy-related feedback at scale. Hark automates the entire process of summarizing privacy feedback, starting from unstructured text and resulting in a hierarchy of high-level privacy themes and fine-grained issues within each theme, along with representative reviews for each issue. At the core of Hark is a set of new deep learning models trained on different tasks, such as privacy feedback classification, privacy issues generation, and high-level theme creation. We illustrate Hark’s efficacy on a corpus of 626M Google Play reviews. Out of this corpus, our privacy feedback classifier extracts 6M privacy-related reviews (with an AUC-ROC of 0.92). With three annotation studies, we show that Hark’s generated issues are of high accuracy and coverage and that the theme titles are of high quality. We illustrate Hark’s capabilities by presenting high-level insights from 1.3M Android apps. View details
    PriSEC: A Privacy Settings Enforcement Controller
    Rishabh Khandelwal
    Thomas Linden
    Kassem Fawaz
    30th USENIX Security Symposium(2021)
    Preview abstract Online privacy settings aim to provide users with control over their data. However, in their current state, they suffer from usability and reachability issues. The recent push towards automatically analyzing privacy notices has not accompanied a similar effort for the more critical case of privacy settings. So far, the best efforts targeted the special case of making opt-out pages more reachable. In this work, we present PriSEC, a Privacy Settings Enforcement Controller that leverages machine learning techniques towards a new paradigm for automatically enforcing web privacy controls. PriSEC goes beyond finding the webpages with privacy settings to discovering fine-grained options, presenting them in a searchable, centralized interface, and – most importantly – enforcing them on-demand with minimal user intervention. We overcome the open nature of web development through novel algorithms that leverage the invariant behavior and rendering of webpages. We evaluate the performance of PriSEC to find that it precisely annotates the privacy controls for 94.3% of the control pages in our evaluation set. To demonstrate the usability of PriSEC, we conduct a user study with 148 participants. We show an average reduction of 3.75x in the time taken to adjust privacy settings compared to the baseline system. View details
    Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity
    Isabel Groves
    Amir Saffari
    The 28th International Conference on Computational Linguistics (COLING 2020) (to appear)
    Preview abstract End-to-end neural data-to-text (D2T) generation has recently emerged as an alternative to pipeline-based architectures. However, it has faced challenges in generalizing to new domains and generating semantically consistent text. In this work, we present DATATUNER, a neural, end-to-end data-to-text generation system that makes minimal assumptions about the data representation and the target domain. We take a two-stage generation-reranking approach, combining a fine-tuned language model with a semantic fidelity classifier. Each of our components is learnt end-to-end without the need for dataset-specific heuristics, entity delexicalization, or post-processing. We show that DATATUNER achieves state of the art results on the automated metrics across four major D2T datasets (LDC2017T10, WebNLG, ViGGO, and Cleaned E2E), with a fluency assessed by human annotators nearing or exceeding the human-written reference texts. We further demonstrate that the model-based semantic fidelity scorer in DATATUNER is a better assessment tool compared to traditional, heuristic-based measures. Our generated text has a significantly better semantic fidelity than the state of the art across all four datasets. View details
    The Applications of Machine Learning in Privacy Notice and Choice
    Kassem Fawaz
    Thomas Linden
    2019 11th International Conference on Communication Systems & Networks (COMSNETS)
    Preview abstract For more than two decades since the rise of the World Wide Web, the “Notice and Choice” framework has been the governing practice for the disclosure of online privacy practices. The emergence of new forms of user interactions, such as voice, and the enforcement of new regulations, such as the EU's recent General Data Protection Regulation (GDPR), promise to change this privacy landscape drastically. This paper discusses the challenges towards providing the privacy stakeholders with privacy awareness and control in this changing landscape. We will also present our recent research on utilizing Machine learning to analyze privacy policies and settings. View details
    The Privacy Policy Landscape after the GDPR
    Thomas Linden
    Rishabh Khandelwal
    Kassem Fawaz
    Proceedings on Privacy Enhancing Technologies(2019)
    Preview abstract The EU General Data Protection Regulation (GDPR) is one of the most demanding and comprehensive privacy regulations of all time. A year after it went into effect, we study its impact on the landscape of privacy policies online. We conduct the first longitudinal, in-depth, and at-scale assessment of privacy policies before and after the GDPR. We gauge the complete consumption cycle of these policies, from the first user impressions until the compliance assessment. We create a diverse corpus of two sets of 6,278 unique English-language privacy policies from inside and outside the EU, covering their pre-GDPR and the post-GDPR versions. The results of our tests and analyses suggest that the GDPR has been a catalyst for a major overhaul of the privacy policies inside and outside the EU. This overhaul of the policies, manifesting in extensive textual changes, especially for the EU-based websites, comes at mixed benefits to the users. While the privacy policies have become considerably longer, our user study with 470 participants on Amazon MTurk indicates a significant improvement in the visual representation of privacy policies from the users' perspective for the EU websites. We further develop a new workflow for the automated assessment of requirements in privacy policies. Using this workflow, we show that privacy policies cover more data practices and are more consistent with seven compliance requirements post the GDPR. We also assess how transparent the organizations are with their privacy practices by performing specificity analysis. In this analysis, we find evidence for positive changes triggered by the GDPR, with the specificity level improving on average. Still, we find the landscape of privacy policies to be in a transitional phase; many policies still do not meet several key GDPR requirements or their improved coverage comes with reduced specificity. View details
    280 Birds with One Stone: Inducing Multilingual Taxonomies from Wikipedia using Character-level Classification
    Amit Gupta
    Rémi Lebret
    Karl Aberer
    32nd AAAI Conference on Artificial Intelligence(2018)
    Preview abstract We propose a novel fully-automated approach towards inducing multilingual taxonomies from Wikipedia. Given an English taxonomy, our approach first leverages the interlanguage links of Wikipedia to automatically construct training datasets for the isa relation in the target language. Character-level classifiers are trained on the constructed datasets, and used in an optimal path discovery framework to induce high-precision, high-coverage taxonomies in other languages. Through experiments, we demonstrate that our approach significantly outperforms the state-of-the-art, heuristics-heavy approaches for six languages. As a consequence of our work, we release presumably the largest and the most accurate multilingual taxonomic resource spanning over 280 languages. View details
    Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning
    Kassem Fawaz
    Rémi Lebret
    Florian Schaub
    Kang G. Shin
    Karl Aberer
    27th USENIX Security Symposium (USENIX Security 18)(2018)
    Preview abstract Privacy policies are the primary channel through which companies inform users about their data collection and sharing practices. These policies are often long and difficult to comprehend. Short notices based on information extracted from privacy policies have been shown to be useful but face a significant scalability hurdle, given the number of policies and their evolution over time. Companies, users, researchers, and regulators still lack usable and scalable tools to cope with the breadth and depth of privacy policies. To address these hurdles, we propose an automated framework for privacy policy analysis (Polisis). It enables scalable, dynamic, and multi-dimensional queries on natural language privacy policies. At the core of Polisis is a privacy-centric language model, built with 130K privacy policies, and a novel hierarchy of neural-network classifiers that accounts for both high-level aspects and fine-grained details of privacy practices. We demonstrate Polisis’ modularity and utility with two applications supporting structured and free-form querying. The structured querying application is the automated assignment of privacy icons from privacy policies. With Polisis, we can achieve an accuracy of 88.4% on this task. The second application, PriBot, is the first freeform question-answering system for privacy policies. We show that PriBot can produce a correct answer among its top-3 results for 82% of the test questions. Using an MTurk user study with 700 participants, we show that at least one of PriBot’s top-3 answers is relevant to users for 89% of the test questions. View details
    "If You Can't Beat them, Join them": A Usability Approach to Interdependent Privacy in Cloud Apps
    Karl Aberer
    Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy (CODASPY '17)(2017)
    Preview abstract Cloud storage services, like Dropbox and Google Drive, have growing ecosystems of 3rd party apps that are designed to work with users' cloud files. Such apps often request full access to users' files, including files shared with collaborators. Hence, whenever a user grants access to a new vendor, she is inflicting a privacy loss on herself and on her collaborators too. Based on analyzing a real dataset of 183 Google Drive users and 131 third party apps, we discover that collaborators inflict a privacy loss which is at least 39% higher than what users themselves cause. We take a step toward minimizing this loss by introducing the concept of History- based decisions. Simply put, users are informed at decision time about the vendors which have been previously granted access to their data. Thus, they can reduce their privacy loss by not installing apps from new vendors whenever possible. Next, we realize this concept by introducing a new privacy indicator, which can be integrated within the cloud apps' authorization interface. Via a web experiment with 141 participants recruited from CrowdFlower, we show that our privacy indicator can significantly increase the user's likelihood of choosing the app that minimizes her privacy loss. Finally, we explore the network effect of History-based decisions via a simulation on top of large collaboration networks. We demonstrate that adopting such a decision- making process is capable of reducing the growth of user privacy loss by 40% in a Google Drive-based network and by 70% in an author collaboration network. This is despite the fact that we neither assume that users cooperate nor that they exhibit altruistic behavior. To our knowledge, our work is the first to provide quantifiable evidence of the privacy risk that collaborators pose in cloud apps. We are also the first to mitigate this problem via a usable privacy approach. View details
    Taxonomy induction using hypernym subsequences
    Amit Gupta
    Rémi Lebret
    Karl Aberer
    CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
    Preview abstract We propose a novel, semi-supervised approach towards domain taxonomy induction from an input vocabulary of seed terms. Unlike all previous approaches, which typically extract direct hypernym edges for terms, our approach utilizes a novel probabilistic framework to extract hypernym subsequences. Taxonomy induction from extracted subsequences is cast as an instance of the minimum-cost flow problem on a carefully designed directed graph. Through experiments, we demonstrate that our approach outperforms state-of-the-art taxonomy induction approaches across four languages. Importantly, we also show that our approach is robust to the presence of noise in the input vocabulary. To the best of our knowledge, this robustness has not been empirically proven in any previous approach. View details
    Data-Driven Privacy Indicators
    Rameez Rahman
    Karl Aberer
    Workshop On Privacy Indicators, at the Twelfth Symposium on Usable Privacy and Security, SOUPS 2016
    Preview abstract Third party applications work on top of existing platforms that host users’ data. Although these apps access this data to provide users with specific services, they can also use it for monetization or profiling purposes. In practice, there is a significant gap between users’ privacy expectations and the actual access levels of 3rd party apps, which are often over-privileged. Due to weaknesses in the existing privacy indicators, users are generally not well-informed on what data these apps get. Even more, we are witnessing the rise of inverse privacy: 3rd parties collect data that enables them to know information about users that users do not know, cannot remember, or cannot reach. In this paper, we describe our recent experiences with the design and evaluation of Data-Driven Privacy Indicators (DDPIs), an approach attempting to reduce the aforementioned privacy gap. DDPIs are realized through analyzing user’s data by a trusted party (e.g., the app platform) and integrating the analysis results in the privacy indicator’s interface. We discuss DDPIs in the context of 3rd party apps on cloud platforms, such as Google Drive and Dropbox. Specifically, we present our recent work on Far-reaching Insights, which show users the insights that apps can infer about them (e.g., their topics of interest, collaboration and activity patterns etc.). Then we present History-based insights, a novel privacy indicator which informs the user on what data is already accessible by an app vendor, based on previous app installations by the user or her collaborators. We further discuss future ideas on new DDPIs, and we outline the challenges facing the wide-scale deployment of such indicators. View details
    The Curious Case of the PDF Converter that Likes Mozart: Dissecting and Mitigating the Privacy Risk of Personal Cloud Apps
    Rameez Rahman
    Bojan Karlas
    Karl Aberer
    16th Privacy Enhancing Technologies Symposium (PETS 2016)
    Preview abstract Third party apps that work on top of personal cloud services such as Google Drive and Dropbox, require access to the user's data in order to provide some functionality. Through detailed analysis of a hundred popular Google Drive apps from Google's Chrome store, we discover that the existing permission model is quite often misused: around two thirds of analyzed apps are over-privileged, i.e., they access more data than is needed for them to function. In this work, we analyze three different permission models that aim to discourage users from installing over-privileged apps. In experiments with 210 real users, we discover that the most successful permission model is our novel ensemble method that we call Far-reaching Insights. Far-reaching Insights inform the users about the data-driven insights that apps can make about them (e.g., their topics of interest, collaboration and activity patterns etc.) Thus, they seek to bridge the gap between what third parties can actually know about users and users perception of their privacy leakage. The efficacy of Far-reaching Insights in bridging this gap is demonstrated by our results, as Far-reaching Insights prove to be, on average, twice as effective as the current model in discouraging users from installing over-privileged apps. In an effort for promoting general privacy awareness, we deploy a publicly available privacy oriented app store that uses Far-reaching Insights. Based on the knowledge extracted from data of the store's users (over 115 gigabytes of Google Drive data from 1440 users with 662 installed apps), we also delineate the ecosystem for third-party cloud apps from the standpoint of developers and cloud providers. Finally, we present several general recommendations that can guide other future works in the area of privacy for the cloud. View details
    PriBots: Conversational Privacy with Chatbots
    Kassem Fawaz
    Kang G. Shin
    Karl Aberer
    Workshop on the Future of Privacy Notices and Indicators, at the Twelfth Symposium on Usable Privacy and Security, SOUPS 2016
    Preview abstract Traditional mechanisms for delivering notice and enabling choice have so far failed to protect users' privacy. Users are continuously frustrated by complex privacy policies, unreachable privacy settings, and a multitude of emerging standards. The miniaturization trend of smart devices and the emergence of the Internet of Things (IoTs) will exacerbate this problem further. In this paper, we propose Conversational Privacy Bots (PriBots) as a new way of delivering notice and choice through a two-way dialogue between the user and a computer agent (a chatbot). PriBots improve on state-of-the-art by o ffering users a more intuitive and natural interface to inquire about their privacy settings, thus allowing them to control their privacy. In addition to presenting the potential applications of PriBots, we describe the underlying system needed to support their functionality. We also delve into the challenges associated with delivering privacy as an automated service. PriBots have the potential for enabling the use of chatbots in other related fields where users need to be informed or to be put in control. View details
    C3P: Context-Aware Crowdsourced Cloud Privacy
    Rameez Rahman
    Karl Aberer
    14th Privacy Enhancing Technologies Symposium (PETS 2014)
    Preview abstract Due to the abundance of attractive services available on the cloud, people are placing an increasing amount of their data online on different cloud platforms. However, given the recent large-scale attacks on users data, privacy has become an important issue. Ordinary users cannot be expected to manually specify which of their data is sensitive or to take appropriate measures to protect such data. Furthermore, usually most people are not aware of the privacy risk that different shared data items can pose. In this paper, we present a novel conceptual framework in which privacy risk is automatically calculated using the sharing context of data items. To overcome ignorance of privacy risk on the part of most users, we use a crowdsourcing based approach. We use Item Response Theory (IRT) on top of this crowdsourced data to determine privacy risk of items and diverse attitudes of users towards privacy. First, we determine the feasibility of IRT for the cloud scenario by asking workers feedback on Amazon mTurk on various sharing scenarios. We obtain a good fit of the responses with the theory, and thus show that IRT, a well-known psychometric model for educational purposes, can be applied to the cloud scenario. Then, we present a lightweight mechanism such that users can crowdsource their sharing contexts with the server and obtain the risk of sharing particular data item(s) anonymously. Finally, we use the Enron dataset to simulate our conceptual framework, and also provide experimental results using synthetic data. We show that our scheme converges quickly and provides accurate privacy risk scores under varying conditions. View details
    Scalable and Secure Polling in Dynamic Distributed Networks
    Sébastien Gambs
    Rachid Guerraoui
    Florian Huc
    Anne-Marie Kermarrec
    31st IEEE International Symposium on Reliable Distributed Systems(2012)
    Preview abstract We consider the problem of securely conducting a poll in synchronous dynamic networks equipped with a Public Key Infrastructure (PKI). Whereas previous distributed solutions had a communication cost of O(n2) in an n nodes system, we present SPP (Secure and Private Polling), the first distributed polling protocol requiring only a communication complexity of O(n log3 n), which we prove is near-optimal. Our protocol ensures perfect security against a computationally-bounded adversary, tolerates (1/2 - ϵ)n Byzantine nodes for any constant 1/2 >; ϵ >; 0 (not depending on n), and outputs the exact value of the poll with high probability. SPP is composed of two sub-protocols, which we believe to be interesting on their own: SPP-Overlay maintains a structured overlay when nodes leave or join the network, and SPP-Computation conducts the actual poll. We validate the practicality of our approach through experimental evaluations and describe briefly two possible applications of SPP: (1) an optimal Byzantine Agreement protocol whose communication complexity is Θ(n log n) and (2) a protocol solving an open question of King and Saia in the context of aggregation functions, namely on the feasibility of performing multiparty secure aggregations with a communication complexity of o(n2). View details