Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

We are releasing the Princeton-Leuven Longitudinal Corpus of Privacy Policies, a reference dataset of over 1 million privacy policy snapshots from more than 100,000 websites, spanning over two decades.

Background

Automated analysis of privacy policies has proved useful for research, but so far there has been no large-scale longitudinal dataset that can be used to study how privacy policies have changed with time.

To address this gap, we are releasing a dataset of over 1 million privacy policies collected from the Internet Archive’s Wayback Machine. To build this dataset, we developed a custom crawler that detects and downloads privacy policies from archived web pages. We processed the downloaded policies to clean up error pages, extract the text of the privacy policies, and filter out non-policy documents using machine learning.

Overview of the data

The dataset contains 1,071,488 English-language privacy policy snapshots from 130,620 distinct websites chosen from the Alexa Top 100K from 2009-2019. In addition to sanitized privacy policy text and raw webpage HTML, the dataset includes metadata such as the archival time and the website URL that the policy belongs to. Although the dataset contains policies from as early as the late 1990s, more than 90% of the policies are from 2007 or later.

Access

To get access to the data, please send an email to privacy-policy-data@lists.cs.princeton.edu stating your name and affiliation.

Our dataset is also available as a Github repository. You can use the web frontend to easily browse the archived policies.

Cite us

If you use our dataset, please cite us:

@inproceedings> over >: > and > of a >->>, booktitle = > 2021>, author = , date = , pages = , publisher = >, location = >, doi = , url = , series = > '21> >

Our dataset is also available as a Github repository. You can use the web frontend to easily browse the archived policies.

Contact

Ryan Amos rbamos@cs.princeton.edu
Gunes Acar gunes.acar@esat.kuleuven.be
Eli Lucherini elucherini@cs.princeton.edu
Mihir Kshirsagar mihir@princeton.edu
Jonathan Mayer jonathan.mayer@princeton.edu
Arvind Narayanan arvindn@cs.princeton.edu