Today, we announced the launch of DataScience Trends, an interactive tool that makes it easy for anyone to explore and visualize data spanning 2.8 million open source repositories that GitHub and Google made publicly available last year.
The GitHub archive is a rich, three-terabyte dataset that includes more than 20 events, from pull requests to file commits. Hidden in that data is important information about the open source tools that are changing the way data scientists work, which DataScience Trends lets users mine without writing code. You can read more about why we built the tool in our press release.
Why are trends in open source software important?
Open source software is essential to what we do here at DataScience. That’s because the teams we serve prefer working with the tools they love; in fact, 62% of them would rather code in open source languages Python or R than in proprietary solution SAS.
For that reason, the DataScience Cloud makes it possible for data science teams to work together using open source tools like Apache Spark, Jupyter notebooks, RStudio, and more. And thanks to DataScience Trends, we know which tools are the most important to our end users based on active development, popularity, and other metrics. That information helps us provide the integrations in our platform that data scientists want, and can help you create an effective open source technology stack of your own.
Want to know what tools you should consider for machine learning tasks and data visualization? Download our new white paper, “DataScience Trends Report: Open Source Tools for Enterprise Data Science.”
How does DataScience Trends work?
DataScience Trends is a user-friendly tool that makes it easy to select both the data you want to compare and the way you want to compare it. Users can choose time-series data from up to five different repos (such as TensorFlow) or aggregations of repos that contain a particular keyword (e.g., “Apache”). You can compare these series in two different views: absolute and normalized. The absolute view does not apply any transformation to the raw data, while the normalized option divides each time point by the mean value within each trend, making it easier to identify correlations and whether certain trends are unique or group-wide events.
DataScience Trends also allows you to remove noise from your plots. Change the moving average to plot each point as an average of all values over a certain time period. Doing so will “smooth” the data; for example, use a moving average of seven to view weekly trends instead of daily ones. You can adjust the moving average up from one day up to six weeks.
At present, DataScience Trends allows you to observe three key metrics associated with GitHub repositories:
- Stars: Stars represent the new daily number of people interested in receiving updates on a particular repo. Use stars to compare public interest — but not necessarily the usage — across repos or series.
- Commits: Commits reflect how many new changes were made to a particular repo. Use commits to compare the pace of development across a set of repos.
- Pull Requests: Pull requests let contributors tell others about changes they've pushed to a repo on GitHub. Use pull requests as a proxy for the size of the community of contributors to selected repos.
And that’s all there is to it! Once you’ve created a plot of machine learning libraries, data visualization packages, or any of the thousands of other repository types available on GitHub, you can share it instantly on Twitter, Facebook, or Linkedin.
Try DataScience Trends for yourself
It’s free to explore more than 10,000 of GitHub’s most popular repositories right in your browser. Sign up for access to start uncovering insights in open source development.