Skip to content

Hugging Face Dataset Guide¶

Create a New Dataset Repository¶

When creating a new dataset repository, you can make the dataset Public (accessible to anyone on the internet) or Private (accessible only to members of the organization).

New dataset repository interface

Upload a Dataset with the Web Interface¶

In the Files and versions tab of the Dataset card, you can choose to add file in the hugging web interface.

Dataset repository Add file button

Upload a Dataset with HfApi¶

from huggingface_hub import login

# Login with your personal token (find your tokens at: Settings/Access Tokens)
login()

from huggingface_hub import HfApi
api = HfApi()

api.upload_file (
    path_or_fileobj = <the local file path that you would like to upload>,
    path_in_repo = <the path in the repo>,
    repo_id = <ABC-Center/dataset name>,
    repo_type = 'dataset'
)

Upload a Dataset with Git¶

If the Dataset is Less Than 5GB¶

Navigate to the folder for the repository:

# Clone the repository
git clone https://huggingface.co/datasets/username/repo-name

# Add, commit, and push the files
git add
git commit -m 'comments'
git push

If the Dataset is Larger Than 5GB¶

Install Git LFS¶

Follow instructions at https://git-lfs.com/

Install the Hugging Face CLI¶

brew install huggingface-cli
pip install -U "huggingface_hub[cli]"

Enable the repository to upload large files¶

huggingface-cli lfs-enable-largefiles <your local dataset>

Initialize Git LFS¶

git lfs install

Track large files (e.g., .csv files)¶

# Adds a line to .gitattributes, which Git uses to determine files managed by LFS
git lfs track "*.csv"  
git add .gitattributes
git commit -m "Track large files with Git LFS"

Add, commit, and push the files¶

git add 
git commit -m 'comments'
git push