This HTML5 document contains 103 embedded RDF statements represented using HTML+Microdata notation.

The embedded RDF content will be recognized by any processor of HTML5 Microdata.

Namespace Prefixes

PrefixIRI
schemahttp://schema.org/
n5https://orcid.org/
n2https://doi.org/10.5281/
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
xsdhhttp://www.w3.org/2001/XMLSchema#
n4https://zenodo.org/record/6517052#

Statements

Subject Item
n2:zenodo.6517052
rdf:type
schema:Dataset
schema:name
GitTables 1M
schema:author
n4:Person n4:Person_1 n5:0000-0002-0949-7290
schema:datePublished
2022-05-04
schema:description
<p><strong>Summary</strong></p> <p>GitTables 1M (<a href="https://gittables.github.io/">https://gittables.github.io</a>) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with&nbsp;a license that&nbsp;allows&nbsp;distribution. We aim to grow this to at least 10M tables.</p> <p>Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations&nbsp;corresponding to &gt;2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions.</p> <p>We believe GitTables can facilitate many use-cases, among which:</p> <ul> <li> <p>Data integration, search and validation.</p> </li> <li> <p>Data visualization and analysis recommendation.</p> </li> <li> <p>Schema analysis and completion for e.g. database or knowledge base design.</p> </li> </ul> <p>If you have questions, the paper, documentation, and contact details are provided on the website: <a href="https://gittables.github.io/">https://gittables.github.io</a>. We recommend using Zenodo&#39;s API to easily download the full dataset (i.e. all zipped topic subsets).</p> <p>&nbsp;</p> <p><strong>Dataset contents</strong></p> <p>The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper <a href="https://arxiv.org/abs/2106.07258v4">https://arxiv.org/abs/2106.07258v4</a>.</p> <p>In summary, this dataset can be characterized as follows:</p> <table> <tbody> <tr> <td> <p><strong>Statistic</strong></p> </td> <td> <p><strong>Value</strong></p> </td> </tr> <tr> <td> <p># tables</p> </td> <td> <p>1M</p> </td> </tr> <tr> <td> <p>average # columns</p> </td> <td> <p>12</p> </td> </tr> <tr> <td> <p>average # rows</p> </td> <td> <p>142</p> </td> </tr> <tr> <td> <p># annotated tables (at least 1 column annotation)</p> </td> <td> <p>723K+ (DBpedia), 738K+ (Schema.org)</p> </td> </tr> <tr> <td> <p># unique semantic types</p> </td> <td> <p>835 (DBpedia), 677 (Schema.org)</p> </td> </tr> </tbody> </table> <p>&nbsp;</p> <p><strong>Future releases</strong></p> <p>Future releases will include the following:</p> <ul> <li> <p>Increased number of tables (expected at least 10M)</p> </li> </ul> <p>&nbsp;</p> <p><strong>Associated&nbsp;datasets</strong></p> <p>- GitTables benchmark -&nbsp;column type detection:&nbsp;<a href="https://zenodo.org/record/5706316">https://zenodo.org/record/5706316</a></p> <p>- GitTables 1M -&nbsp;CSV files:&nbsp;<a href="https://zenodo.org/record/6515973">https://zenodo.org/record/6515973</a></p> <p>&nbsp;</p>
schema:distribution
n4:DataDownload_57 n4:DataDownload_60 n4:DataDownload_59 n4:DataDownload_30 n4:DataDownload_29 n4:DataDownload_32 n4:DataDownload_31 n4:DataDownload_34 n4:DataDownload_94 n4:DataDownload_33 n4:DataDownload_93 n4:DataDownload_36 n4:DataDownload_35 n4:DataDownload_95 n4:DataDownload_38 n4:DataDownload_37 n4:DataDownload_40 n4:DataDownload_39 n4:DataDownload_42 n4:DataDownload_41 n4:DataDownload_44 n4:DataDownload_43 n4:DataDownload_14 n4:DataDownload_13 n4:DataDownload_16 n4:DataDownload_15 n4:DataDownload_18 n4:DataDownload_78 n4:DataDownload_17 n4:DataDownload_77 n4:DataDownload_20 n4:DataDownload_80 n4:DataDownload_19 n4:DataDownload_79 n4:DataDownload_22 n4:DataDownload_82 n4:DataDownload_21 n4:DataDownload_81 n4:DataDownload_24 n4:DataDownload_84 n4:DataDownload_23 n4:DataDownload_83 n4:DataDownload_26 n4:DataDownload_86 n4:DataDownload_25 n4:DataDownload_85 n4:DataDownload_28 n4:DataDownload_88 n4:DataDownload_27 n4:DataDownload_87 n4:DataDownload n4:DataDownload_90 n4:DataDownload_89 n4:DataDownload_92 n4:DataDownload_91 n4:DataDownload_2 n4:DataDownload_62 n4:DataDownload_1 n4:DataDownload_61 n4:DataDownload_4 n4:DataDownload_64 n4:DataDownload_3 n4:DataDownload_63 n4:DataDownload_6 n4:DataDownload_66 n4:DataDownload_5 n4:DataDownload_65 n4:DataDownload_8 n4:DataDownload_68 n4:DataDownload_7 n4:DataDownload_67 n4:DataDownload_10 n4:DataDownload_70 n4:DataDownload_9 n4:DataDownload_69 n4:DataDownload_12 n4:DataDownload_72 n4:DataDownload_11 n4:DataDownload_71 n4:DataDownload_46 n4:DataDownload_74 n4:DataDownload_45 n4:DataDownload_73 n4:DataDownload_48 n4:DataDownload_76 n4:DataDownload_47 n4:DataDownload_75 n4:DataDownload_50 n4:DataDownload_49 n4:DataDownload_52 n4:DataDownload_51 n4:DataDownload_54 n4:DataDownload_53 n4:DataDownload_56 n4:DataDownload_55 n4:DataDownload_58