Overview
- GitHub published a repository‑level dataset under the permissive CC0‑1.0 license that records language signals for over 40 million public repositories and more than 80 million classification rows.
- For each repository the dataset supplies language classifications for the README, the most‑commented issue, and the most‑commented pull request using the first 150 characters as the sample and excluding texts under 20 characters.
- Language labels come from three separate classifiers—fastText, gcld3, and lingua‑py—with each entry including a confidence score and only classifications above 0.5 retained, so users can choose strict agreement or broader recall.
- The release highlights that language use varies by text type, with Portuguese appearing in more than 3 million non‑English READMEs and Korean the most common non‑English language in issue text, and GitHub stresses the dataset is metadata‑only not a ground‑truth benchmark.
- GitHub pairs the public release with a June 16 presentation in Strasbourg to engage policymakers and researchers, and it warns the data should not be used to infer sensitive attributes about repository owners or contributors.