TL;DR: This blog post explains how to retrieve the edit and comment history of Stack Overflow threads using the SOTorrent dataset.
Information about the most recent database layout and dataset version can be found on the project page.
Creating the Edit History Table
In our MSR 2018 paper, we describe a close relationship between post edits and comments.
To support a further investigation of this relationship, we show how to create a new table
EditHistory, which aggregates all title and body edits of Stack Overflow posts, together with post comments.
Using this table, one can easily retrieve the edit and comment history of individual threads (see below).
The BigQuery versions of the following queries can be found here.
CREATE TABLE EditHistory AS SELECT * FROM ( SELECT pv.PostId as PostId, pv.PostTypeId as PostTypeId, PostHistoryId as EventId, CASE WHEN pv.PostHistoryTypeId=2 THEN "InitialBody" ELSE "BodyEdit" END as Event, UserId, pv.CreationDate AS CreationDate FROM PostVersion pv JOIN PostHistory ph ON pv.PostHistoryId = ph.Id UNION ALL SELECT tv.PostId as PostId, tv.PostTypeId as PostTypeId, PostHistoryId as EventId, CASE WHEN tv.PostHistoryTypeId=1 THEN "InitialTitle" ELSE "TitleEdit" END as Event, UserId, tv.CreationDate as CreationDate FROM TitleVersion tv JOIN PostHistory ph ON tv.PostHistoryId = ph.Id UNION ALL SELECT PostId, PostTypeId, c.Id as EventId, "Comment" AS Event, UserId, c.CreationDate as CreationDate FROM Comments c JOIN Posts p ON c.PostId = p.Id ) AS EditHistory; ALTER TABLE EditHistory ADD INDEX EditHistoryPostIdIndex (PostId); ALTER TABLE EditHistory ADD INDEX EditHistoryEventIdIndex (EventId);
Retrieving the History of Individual Threads
We first create a helper table that allows to retrieve the edit history of a thread using the
PostId of either a question or an answer:
CREATE TABLE Threads AS SELECT Id as PostId, PostTypeId, CASE WHEN PostTypeId=1 THEN Id WHEN PostTypeId=2 THEN ParentId END as ParentId FROM Posts # only consider questions and answers WHERE PostTypeId=1 OR PostTypeId=2; ALTER TABLE Threads ADD INDEX ThreadsPostIdIndex (PostId); ALTER TABLE Threads ADD INDEX ThreadsParentIdIndex (ParentId);
Then, we retrieve the edit history of the Stack Overflow thread mentioned in this earlier blog post with the following query:
SELECT * FROM EditHistory WHERE PostId IN ( SELECT PostId FROM Threads WHERE ParentId = ( SELECT ParentID FROM Threads # this is an answer id, the question id 3758606 yields the same result WHERE PostId=3758880 ) ) ORDER BY CreationDate;
The following figure shows the result of the above query (truncated):
To retrieve the edit history on the level of individual text or code blocks, one could either modify the table creation query or join the
BodyEdit events with table
PostBlockVersion, using the
PostHistoryId of the events.
First insights into the evolution of Stack Overflow posts, and in particular the connection of edits and comments, can be found in our research paper about SOTorrent. The dataset is available as a database dump (including import scripts) on Zenodo and as a public Google BigQuery dataset. Visit the project page to get the most recent version.