Groundhog

A framework for crawling GitHub projects and raw data and to extract metrics from them

Download .zip Download .tar.gz View on GitHub

Groundhog

Build Status

Groundhog is an easy to use framework for crawling raw GitHub data and to extract metrics from it. It leverages the power of the Java language, as well as the Github plataform, to help researchers to better understand software repositories. Groundhog goals are flexibility, extensibility and simplicity.

WARNING: Groundhog is currently alpha-software and its API will suffer major changes. The current version is an experiment that showcases using the GitHub API and the JavaCompiler.

On the developer side, Groundhog focuses on simplicity by shipping with a bare stack, allowing a research to get started quickly while making it easy to extend the application as and when they see fit.

It is currently alpha software and it supports:

  • Partially integrated with GitHub API
  • Extract and download projects from three forges
  • Collect metrics about Java code
  • Plug-and-play features

Before becoming beta, we want to add the following to Groundhog:

  • Fully-integrated with GitHub API

Build

Groundhog uses Java 7 features, so you must have it installed before build. Groundhog also uses Maven, so to build the project you will need to download and install the tool.

Bulding for Eclipse

In order for it to behave like an Eclipse project, you'll need to run the following command in the command line:

$ mvn eclipse:eclipse

If you prefer, you can just use the groundhog.jar file from command line. You can generate this file in two ways:

Generating the JAR via Eclipse

Eclipse users can go to File > Export > Runnable JAR File and enter the CmdMain class for the option "Launch Configuration".

Generating the JAR via command line

Maven users can simply type in the root directory:

$ mvn package

and the jar will be created in the target/ path.

Running tests

$ mvn test

Usage

You can use Groundhog in two ways: as an executable JAR from the command line or as a library in your own Java project.

Using as a executable JAR

Search GitHub for projects matching "phonegap-facebook-plugin" and place the results (if any) in a folder called metrics:

$ java -jar groundhog.jar -forge github -out metrics phonegap-facebook-plugin

Using as a third-party library

Fetching Metadata

Metadata is fetched from GitHub's API. In order to be able to fetch more objects, you need to obtain your GitHub API token and use it in Groundhog.

Project

You can use Groundhog to fetch metadata on a list of projects that attend to a criteria

// Create a GitHub search object
Injector injector = Guice.createInjector(new SearchModule());
SearchGitHub searchGitHub = injector.getInstance(SearchGitHub.class);

// Search for projects named "opencv" starting in page 1 and stoping and going until the 3rd project
List<Project> projects = searchGitHub.getProjects("opencv", 1, 3);

Alternatively, you can search for projects without setting the limiting point. In this case Groundhog will fetch projects until your API limit is exceeded.

List<Project> projects = searchGitHub.getProjects("eclipse", 1, SearchGitHub.INFINITY)

Issues

Issues are objects that only make sense from a Project perspective.

To fetch the Issues of a given project using Groundhog you should first create the Project and then tell Groundhog to hit the API and get the data.

User user = new User("joyent");         // Create the User object
Project pr = new Project(user, "node"); // Create the Project object

// Tell Groundhog to fetch all Issues of that project and assign them the the Project object:
List<Issue> issues = searchGitHub.getAllProjectIssues(pr);

System.out.println("Listing 'em Issues...");
for (Issue issue: issues) {
  System.out.println(issue.getTitle());
}

Commits

You can easily fetch all commits of a project

User user = new User("gustavopinto");
Project project = new Project(user, "groundhog-case-study");

List<Commit> commits = searchGitHub.getAllProjectCommits(project);

for (Commit com: commits) {
    System.out.println(com);
}

Milestones

Just like Issues, Groundhog lets you fetch the list of Milestones of a project, too.

List<Milestone> milestones = searchGitHub.getAllProjectMilestones(pr);

Languages

Software projects are often composed of more than one programming language. Groundhog lets you fetch the list of languages of a project among its LoC (lines of code) count.

// Returns a List of Language objects for each language of project "pr"
List<Language> languages = searchGitHub.fetchProjectLanguages(pr);

Contributors

You can also get the list of people who contributed to a project on GitHub:

User user = new User("rails");
Project project = new Project(user, "rails"); // project github.com/rails/rails

List<User> contributors = searchGitHub.getAllProjectContributors(project);

Local Data Extraction

In addition to the metadata extraction allowed via the GitHub API, Groundhog covers local data extraction onto repositories via a Git interface

You can, for example, count the number of commits in a project that include a Java file, via a GitCommitExtractor object:

GitCommitExtractor extractor = new GitCommitExtractor();
File project = new File("/tmp/elasticsearch");

extractor.numberOfCommitsWithExtension(project, "java");

Documentation

Groundhog features a Wiki, where you can browse for more information.

You can generate the Javadoc files with the following command:

$ cd src/
$ javadoc -d src/src/groundhog br.cin.ufpe.groundhog

Core team

Contributions

Want to contribute with code, documentation or bug report? That's great, check out the Issues page.

Preferably, your contribution should be backed by JUnit tests.

Stability and maintainability are important goals of the Groundhog project. Well-written, tested code and proper documentation are very welcome.

License

Groundhog is released under GPL 2.