GitTables is a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. We annotated table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. If you have questions: documentation and contact details are provided on our website: https://gittables.github.io.
This dataset corresponds to the 1.7M tables used for the analysis in the GitTables paper: https://arxiv.org/abs/2106.07258. Characteristics about the table corpus (e.g. table sizes and topical distribution) are reported in this paper.
Responsible use
The current versions of GitTables, up to 0.0.4, contain tables extracted from CSV files from public GitHub repositories, hence some tables might not be associated with a license that allows distribution. A new version of GitTables with licensed tables only will be released soon, with the licenses attached to the file metadata. In the meantime, we suggest to use GitHub's License API to retrieve the license associated with the table (you can use the URL in the metadata to do so) to understand what restrictions apply to each table.
Please be aware that this dataset is uncurated, at the moment, hence the underlying data files might exhibit sensitive, harmful or otherwise undesired data. Similarly, biases towards certain subpopulations might be observed. The next release will be curated to mitigate this. The spread and exact replication of undesired content should be avoided. If any of these issues are observed we would like to be notified so that we can mitigate them. Please use the contact form https://gittables.github.io to report this.