Data Merging
Overview
This document details the code and its functionalities in the Jupyter Notebook designed to prepare and upload a dataset to Hugging Face’s hub.
Unzipping Data
!unzip /content/data_LLM.zip
Comment: Unzips the ‘data_LLM.zip’ file, ensuring the raw data is accessible for processing.
Installing Datasets Package
!pip install datasets
Comment: Installs the ‘datasets’ package necessary for efficient data handling and processing in Python.
Merging JSON Files into JSONL
import os
import json
import glob
directory = "/content/data_LLM"
output_jsonl_filename = "merged_dataset.jsonl"
json_pattern = os.path.join(directory, '*.json')
file_list = glob.glob(json_pattern)
with open(output_jsonl_filename, 'w') as outfile:
for file in file_list:
with open(file, 'r') as f:
json_obj = json.load(f)
outfile.write(json.dumps(json_obj) + '\n')
Comment: Reads multiple JSON files from the specified directory and merges them into a single JSONL file, creating a unified dataset structure.
Loading Dataset
from datasets import Dataset, Features, Value, ClassLabel, Sequence, load_dataset
jsonl_file_path = output_jsonl_filename
dataset = load_dataset('json', data_files=jsonl_file_path)
print(dataset['train'][0])
Comment: Loads the merged JSONL file as a dataset using the ‘datasets’ library and prints the first entry for verification.
Authentication for Hugging Face
from huggingface_hub import notebook_login
notebook_login()
Comment: Prompts for Hugging Face authentication, ensuring secure access for uploading the dataset.
Pushing to Hugging Face
dataset.push_to_hub("badreddine_LLM_data")
Comment: Pushes the prepared dataset to Hugging Face’s hub under the specified repository name, making it available for global access.
Conclusion
This document provided a step-by-step guide to the notebook’s process for preparing and uploading a dataset to Hugging Face’s hub.