Sampling Papers Using DBLP

2 minute read

In our ESEM 2016 paper, we reported on methodological and ethical issues when sampling software developers. This time, I want to approach the topic of sampling from a different perspective. For a research project, we wanted to draw a random sample from all full papers published in certain software engineering journals and conference proceedings in a certain time frame. To able draw such a sample of papers, we needed a corresponding sampling frame. Instead of accessing each conference proceeding and journal issue manually, we decided to use the curated data that DBLP provides.

We first implemented an approach using DBLP’s API, but due to a bug that affected the retrieval for certain conference identifiers (e.g. ICSE 2015), we switched to a solution that uses DBLP’s HTML pages. We implemented a web scraper that takes a list of venue identifiers as input and then retrieves all papers published in those venues, including the following metadata:

paper title
authors
heading (corresponds to journal issue or conference session name)
page range
paper length
link to electronic edition of paper

The CSV file that the tool creates has the following structure:

venue	year	identifier	heading	title	authors	pages	length	electronic_edition
ICSE	2014	conf/icse/icse2014	Perspectives on Software Engineering	Cowboys, ankle sprains, and keepers of quality: how is video game development different from software development?	Emerson R. Murphy-Hill; Thomas Zimmermann; Nachiappan Nagappan	1-11	11	https://doi.org/10.1145/2568225.2568226
…	…	…	…	…	…	…	…	…
ICSE	2014	conf/icse/icse2014	Modeling	TradeMaker: automated dynamic analysis of synthesized tradespaces.	Hamid Bagheri; Chong Tang; Kevin J. Sullivan	106-116	11	https://doi.org/10.1145/2568225.2568291
…	…	…	…	…	…	…	…	…
TOSEM	2018	journals/tosem/tosem27	Volume 27, Number 4, November 2018	Variability-Aware Static Analysis at Scale: An Empirical Study.	Alexander von Rhein; Jörg Liebig; Andreas Janker; Christian Kästner; Sven Apel	18:1-18:33	33	https://dl.acm.org/citation.cfm?id=3280986

The property heading is particularly relevant, because it allows us to later exclude papers published, for example, in NIER or tool demo tracks. Moreover, the property paper length, which is derived from the page range, enables us to remove, for example, keynote descriptions or extended abstracts of journal-first papers. Automatically filtering papers based on their length is difficult, because there exist relatively short journal papers (example), but also relatively long keynote papers (example).

After manually removing non-full papers from the list, we have a sampling frame that we can use to draw a stratified sample, for example by randomly selecting five papers for each venue and year. We tested our tool with ICSE, FSE, TSE, and TOSEM for the time frame 2014 to 2018.

More Information

The dblp-retriever tool is available on GitHub.

Twitter Facebook LinkedIn

Sebastian Baltes

More Information