the_pile_books3

Riferimenti:

testo_normale

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:the_pile_books3/plain_text')
  • Descrizione :
This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset. This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'train' 196639
  • Caratteristiche :
{
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}
,

Riferimenti:

testo_normale

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:the_pile_books3/plain_text')
  • Descrizione :
This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset. This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'train' 196639
  • Caratteristiche :
{
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}