Edit and Comment History of Stack Overflow Threads
TL;DR: This blog post explains how to retrieve the edit and comment history of Stack Overflow threads using the SOTorrent dataset.
Information about the most recent database layout and dataset version can be found on the project page.
Creating the Edit History Table
In our MSR 2018 paper, we describe a close relationship between post edits and comments.
To support a further investigation of this relationship, we show how to create a new table EditHistory
, which aggregates all title and body edits of Stack Overflow posts, together with post comments.
Using this table, one can easily retrieve the edit and comment history of individual threads (see below).
The BigQuery versions of the following queries can be found here.
CREATE TABLE EditHistory AS
SELECT *
FROM (
SELECT
pv.PostId as PostId,
pv.PostTypeId as PostTypeId,
PostHistoryId as EventId,
CASE
WHEN pv.PostHistoryTypeId=2 THEN "InitialBody"
ELSE "BodyEdit"
END as Event,
UserId,
pv.CreationDate AS CreationDate
FROM PostVersion pv
JOIN PostHistory ph
ON pv.PostHistoryId = ph.Id
UNION ALL
SELECT
tv.PostId as PostId,
tv.PostTypeId as PostTypeId,
PostHistoryId as EventId,
CASE
WHEN tv.PostHistoryTypeId=1 THEN "InitialTitle"
ELSE "TitleEdit"
END as Event,
UserId,
tv.CreationDate as CreationDate
FROM TitleVersion tv
JOIN PostHistory ph
ON tv.PostHistoryId = ph.Id
UNION ALL
SELECT
PostId,
PostTypeId,
c.Id as EventId,
"Comment" AS Event,
UserId,
c.CreationDate as CreationDate
FROM Comments c
JOIN Posts p
ON c.PostId = p.Id
) AS EditHistory;
ALTER TABLE EditHistory ADD INDEX EditHistoryPostIdIndex (PostId);
ALTER TABLE EditHistory ADD INDEX EditHistoryEventIdIndex (EventId);
Retrieving the History of Individual Threads
We first create a helper table that allows to retrieve the edit history of a thread using the PostId
of either a question or an answer:
CREATE TABLE Threads AS
SELECT
Id as PostId,
PostTypeId,
CASE
WHEN PostTypeId=1 THEN Id
WHEN PostTypeId=2 THEN ParentId
END as ParentId
FROM Posts
# only consider questions and answers
WHERE PostTypeId=1
OR PostTypeId=2;
ALTER TABLE Threads ADD INDEX ThreadsPostIdIndex (PostId);
ALTER TABLE Threads ADD INDEX ThreadsParentIdIndex (ParentId);
Then, we retrieve the edit history of the Stack Overflow thread mentioned in this earlier blog post with the following query:
SELECT *
FROM EditHistory
WHERE PostId IN (
SELECT PostId
FROM Threads
WHERE ParentId = (
SELECT ParentID
FROM Threads
# this is an answer id, the question id 3758606 yields the same result
WHERE PostId=3758880
)
)
ORDER BY CreationDate;
The following figure shows the result of the above query (truncated):
Possible Extensions
To retrieve the edit history on the level of individual text or code blocks, one could either modify the table creation query or join the InitialBody
and BodyEdit
events with table PostBlockVersion
, using the PostHistoryId
of the events.
More Information
First insights into the evolution of Stack Overflow posts, and in particular the connection of edits and comments, can be found in our research paper about SOTorrent. The dataset is available as a database dump (including import scripts) on Zenodo and as a public Google BigQuery dataset. Visit the project page to get the most recent version.