The SOTorrent Dataset

Like other software artifacts, questions and answers on Stack Overflow evolve over time, for example when bugs in code snippets are fixed or text surrounding a code snippet is edited for clarity. To be able to analyze how Stack Overflow posts evolves, we built SOTorrent, an open dataset based on the official Stack Overflow data dump. SOTorrent provides access to the version history of Stack Overflow content at the level of whole posts and individual text or code blocks. It connects Stack Overflow posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to Stack Overflow posts. More information about SOTorrent and the results of a first analysis using the dataset can be found in this blog post and in the corresponding research paper. The source code of the tools used to build and analyze SOTorrent is available on GitHub.

Dataset versions

The changelog for each dataset version is available here.

2018-06-17

2018-05-04

2018-03-28

2018-02-16

2017-12-24

Database layout

SOTorrent contains all tables from the official Stack Overflow data dump. Additionally, the dataset provides the following tables:

PostVersion: Version history on the level of whole Stack Overflow posts.
PostBlockVersion: Version history on the level of post blocks, which are either text (1) or code (2) blocks.
PostBlockDiff: Line-based difference between post block versions.
TitleVersion: Version history of question titles.
PostVersionUrl: URLs extracted from text block versions.
CommentUrl: URLs extracted from post comments.
PostReferenceGH: URLs extracted from GitHub repositories pointing to Stack Overflow posts.

In the following database schema, the tables from the offical Stack Overflow data dump are marked gray, the additional tables are marked blue. Please note that not all foreign key constraints are shown.

SOTorrent database schema (2018-05-04)

Licenses

The following tables are identical to the corresponding XML files in the official Stack Exhange data dump, which is licensed under Creative Commons Attribution-Share Alike 3.0 Unported (CC BY-SA 3.0):

Badges, Comments, PostHistory, PostLinks, Posts, Tags, Users, Votes, PostType

The table PostReferenceGH was retrieved from the Google BigQuery GitHub data set, for which the GitHub Terms of Service apply.

The following tables are based on the tables from the offical Stack Exchange data dump listed above. We license them under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0):

CommentUrl, PostBlockDiff, PostBlockVersion, PostVersion, PostVersionUrl, TitleVersion

Publication

SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts.
Sebastian Baltes, Lorik Dumani, Christoph Treude, and Stephan Diehl.
Proceedings of the 15th International Conference on Mining Software Repositories (MSR 2018).
Acceptance rate: 33% (37/113).
Invited to an EMSE journal special issue.
Preprint arXiv Slides