AlphaBit - OpenML

Training Dataset

Training Dataset [3 Samples] [Larger dataset] (3671 Labeled Images) (343 Backgrounds)

Training Dataset [3 Samples] [Medium dataset] (2653 Labeled Images) (317 Backgrounds) [Recommended for your first time training]

Download

This dataset was created specifically for training machine learning models and is available open-source. The images in the set are organized into well-defined categories (classes) to ensure a clear structure that is easy to use in the training process.

Key Points:

Class Balance: To prevent instability during model training, it is essential that the difference in the number of images between classes does not exceed 5%. A larger imbalance can lead to suboptimal performance and generalization issues.

Verification and Validation: At the end of the documentation, a Python script is included that iterates through all images in the set and displays the number of images for each class. This verification tool helps maintain dataset integrity and ensures compliance with the balance requirement.

Data Distribution and the Importance of Validation in AI Training

When training an AI, it is essential to split the dataset into two main parts:

Training Data: This represents 80% or 90% of the total data. The model "learns" from this data, identifying patterns and relationships relevant to the given task.

Validation Data: This represents the remaining 20% or 10% of the data. It is used to evaluate the model's performance on data it has not seen during training.

Why is Validation Important?

Avoiding Overfitting: Validation allows detection of situations where the model fits the training data too well but fails to generalize to new data.

Choosing Optimal Parameters: By evaluating performance on the validation set, you can adjust the model's hyperparameters to improve accuracy and robustness.

Objective Evaluation: The validation set offers a realistic estimate of the model's performance in real-world situations, on unknown data.

Choosing Between 80/20 and 90/10:

80% Training / 20% Validation: Recommended when you have a sufficiently large dataset. A larger validation set helps you evaluate model performance more precisely.

90% Training / 10% Validation: Preferable when the dataset is smaller. Thus, the model benefits from more examples for learning, but evaluation is done on a smaller validation set, which may offer a slightly less robust picture of performance.

Python Script For Checking Class Balance

import os import xml.etree.ElementTree as ET from termcolor import colored voc_labels_dir = "datasets/AI/train/images" # Path to your VOC XML label files yolo_labels_dir = "datasets/AI/train/labels" # Path to save YOLO format label files image_dir = "datasets/AI/train/images" # Path to your images os.makedirs(yolo_labels_dir, exist_ok=True) def convert_bbox(size, box): dw = 1.0 / size[0] dh = 1.0 / size[1] x = (box[0] + box[1]) / 2.0 y = (box[2] + box[3]) / 2.0 w = box[1] - box[0] h = box[3] - box[2] return (x * dw, y * dh, w * dw, h * dh) class_mapping = { "YellowSample": 0, "BlueSample": 1, "RedSample": 2 } yellow_count = 0 blue_count = 0 red_count = 0 total_labeled_images = 0 duplicates = 0 lastFile = "" hy = 0 hb = 0 hr = 0 maxl = 0 b_differece = 0 r_differece = 0 y_differece = 0 for file in os.listdir(voc_labels_dir): if file.endswith(".xml"): xml_path = os.path.join(voc_labels_dir, file) tree = ET.parse(xml_path) root = tree.getroot() yolo_path = os.path.join(yolo_labels_dir, file.replace(".xml", ".txt")) with open(yolo_path, "w") as f: for obj in root.findall("object"): class_name = obj.find("name").text if class_name in class_mapping: class_id = class_mapping[class_name] if class_id == 0: yellow_count += 1 elif class_id == 1: blue_count += 1 elif class_id == 2: red_count += 1 total_labeled_images += 1 if lastFile == file: duplicates += 1 else: print(f"Warning: Unknown class '{class_name}' in {file}") lastFile = file if yellow_count > blue_count and yellow_count > red_count: maxl = yellow_count elif blue_count > yellow_count and blue_count > red_count: maxl = blue_count elif red_count > yellow_count and red_count > blue_count: maxl = red_count if maxl == yellow_count: b_differece = maxl - blue_count r_differece = maxl - red_count elif maxl == blue_count: y_differece = maxl - yellow_count r_differece = maxl - red_count elif maxl == red_count: y_differece = maxl - yellow_count b_differece = maxl - blue_count if maxl * 0.05 < b_differece or maxl * 0.05 < r_differece or maxl * 0.05 < y_differece: print(colored("Warning: There is a difference of more than 5% between the classes.", "red")) print(colored("Please check the labeled images and make sure that the classes are balanced.", "red")) else: print(colored("Classes are balanced.", "green")) print("") print("Total labeled images: " + colored(str(total_labeled_images-duplicates), "green")) print(colored("Yellow samples: ", "yellow") + colored(str(yellow_count), "green") + " | [Difference]: " + colored(str(y_differece), "red")) print(colored("Blue samples: ", "blue") + colored(str(blue_count), "green") + " | [Difference]: " + colored(str(b_differece), "red")) print(colored("Red samples: ", "red") + colored(str(red_count), "green") + " | [Difference]: " + colored(str(r_differece), "red"))

Support -> Discord

Choose Language / Alege Limba