Datasets

We are happy to share our data with you! This page provides access to two kinds of datasets: our preprocessed content of static repository data as well as the interaction data we capture with the FeedBaG++ tool. We provide tool support and bindings for the datasets both in C# and in Java to support a wide range of applications.

All datasets contain multiple files that are bundled in a single download archive for convenience. Please unpack this archive before working with it. You will find detailed examples of how to make use of the data in the project repository.

Static Repository Data

This dataset contains the contents of several preprocessed repositories. The original source code is transformed to the intermediate representation "Simplified Syntax Trees" (SSTs).

By downloading these files, you agree to use them only for scientific purposes and that you respect the licenses of the original files that we have transformed into SSTs.

Reran the transformation on the previous checkout. The new version significantly increases the number of resolved types and includes several improvements for the transformation.

Reran the transformation on the previous checkout after several extensions and bug fixes to the analysis logic.

Original publication at MSR. (Checked out on Feb 14, 2016)

Please note: You should always use the most recent version of the dataset as you might not be able to read and process older archives with the current toolchain, otherwise. If you still decide to do it, make sure to process it with the KaVE sources that were available on the master branch when the dataset was created.

Interaction Data

This dataset contains all events that FeedBaG++ users have shared with us. We preprocess the data before publication: we merge uploads by the same users, sort events by timestamp, and remove some obvious noise from the data.

By downloading these files, you agree to use them only for scientific purposes. You are not allowed to use the data to identify individuals (e.g., by combining it with other data sources) or in any other work that could be used to harm the participants.

Please note: The dataset contains events from different versions of FeedBaG++. Older versions might contain bugs or still miss functionality added later (see release notes). You can use the versioning information stored in the events to filter the dataset to a version that fits your needs. The files are all cumulative, meaning the contain all events at that time and it is sufficient to just download the newest file.