Not logged in : Login
(Sponging disallowed)

About: GitTables 1.7M     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : schema:Dataset, within Data Space : demo.openlinksw.com associated with source document(s)

GitTables is a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. We annotated table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. If you have questions: documentation and contact details are provided on our website: https://gittables.github.io. This dataset corresponds to the 1.7M tables used for the analysis in the GitTables paper: https://arxiv.org/abs/2106.07258. Characteristics about the table corpus (e.g. table sizes and topical distribution) are reported in this paper. Responsible use The current versions of GitTables, up to 0.0.4, contain tables extracted from CSV files from public GitHub repositories, hence some tables might not be associated with a license that allows distribution. A new version of GitTables with licensed tables only will be released soon, with the licenses attached to the file metadata. In the meantime, we suggest to use GitHub's License API to retrieve the license associated with the table (you can use the URL in the metadata to do so) to understand what restrictions apply to each table. Please be aware that this dataset is uncurated, at the moment, hence the underlying data files might exhibit sensitive, harmful or otherwise undesired data. Similarly, biases towards certain subpopulations might be observed. The next release will be curated to mitigate this. The spread and exact replication of undesired content should be avoided. If any of these issues are observed we would like to be notified so that we can mitigate them. Please use the contact form https://gittables.github.io to report this.

AttributesValues
type
name
  • GitTables 1.7M
author
datePublished
description
  • GitTables is a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. We annotated table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. If you have questions: documentation and contact details are provided on our website: https://gittables.github.io.

    This dataset corresponds to the 1.7M tables used for the analysis in the GitTables paper: https://arxiv.org/abs/2106.07258. Characteristics about the table corpus (e.g. table sizes and topical distribution) are reported in this paper.

    Responsible use

    The current versions of GitTables, up to 0.0.4, contain tables extracted from CSV files from public GitHub repositories, hence some tables might not be associated with a license that allows distribution. A new version of GitTables with licensed tables only will be released soon, with the licenses attached to the file metadata. In the meantime, we suggest to use GitHub's License API to retrieve the license associated with the table (you can use the URL in the metadata to do so) to understand what restrictions apply to each table.

    Please be aware that this dataset is uncurated, at the moment, hence the underlying data files might exhibit sensitive, harmful or otherwise undesired data. Similarly, biases towards certain subpopulations might be observed. The next release will be curated to mitigate this. The spread and exact replication of undesired content should be avoided. If any of these issues are observed we would like to be notified so that we can mitigate them. Please use the contact form https://gittables.github.io to report this.

schema:distribution
Faceted Search & Find service v1.17_git144 as of Jul 26 2024


Alternative Linked Data Documents: iSPARQL | ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3331 as of Aug 25 2024, on Linux (x86_64-ubuntu_noble-linux-glibc2.38-64), Single-Server Edition (378 GB total memory, 39 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software