The impact of tag encoding and random seeds on model performance and reproducibility strategy in Autokeras-Python Tutorial-php.cn

Table of Contents

Tag processing mechanism in Autokeras

Random seeds and model reproducibility

Strategies to ensure the reproducibility of Autokeras models

Optimized label coding practice

Summarize

Home

Backend Development

Python Tutorial

The impact of tag encoding and random seeds on model performance and reproducibility strategy in Autokeras

Linda Hamilton

Oct 06, 2025 pm 05:33 PM

The impact of tag encoding and random seeds on model performance and reproducibility strategies in Autokeras

Directly using One-Hot encoding tags vs. converting to integer tags can lead to significant performance differences when using Autokeras' StructuredDataClassifier. This difference is not due to the fundamental error of Autokeras's way of handling tags, but is often closely related to the effects of random seeds during model training and hyperparameter search. To ensure the stability of model performance and the reproducibility of experimental results, it is crucial to correctly set up random seeds and understand the internal mechanisms of Autokeras.

Tag processing mechanism in Autokeras

In machine learning classification tasks, label encoding is a key step in data preprocessing. Common encoding methods include One-Hot encoding and integer encoding. For Autokeras' StructuredDataClassifier, it is designed to handle classification tasks, and is usually expected to receive category tags in the form of integers. Even if you provide One-Hot encoded tags, Autokeras treats it as a classification problem when it is processed internally and converts and processes accordingly in its internal pipeline.

In fact, autokeras will convert it into One-Hot encoding after receiving the integer tags, in order to be compatible with loss functions, such as CategoricalCrossentropy, which are commonly used in multi-classification tasks. You can verify if the OneHotEncoder object exists in its preprocessor chain by checking clf.outputs[0].in_blocks[0].get_hyper_preprocessors(), and confirm the loss function used by clf.outputs[0].in_blocks[0].loss. This means that whether you provide the original One-Hot encoding or the converted integer tags, the internal label representation and loss functions used by the final model training are usually consistent. Therefore, when a huge performance difference is observed between the two (for example from 0.40 to 0.97), the problem is often not the “correctness” of the tag encoding, but other factors.

Random seeds and model reproducibility

As an automated machine learning (AutoML) tool, Autokeras performs a large number of random operations when looking for the best model architecture and hyperparameters, such as:

Hyperparameter search space exploration: Different random initializations may cause search algorithms to explore different combinations of hyperparameters.
Model weight initialization: The initial weight of a neural network is usually random.
Data shuffle: Training data is usually randomly shuffled before each epoch starts.
Dropout layer: The Dropout operation itself is random.

These randomnesses can produce different results each time the code is run, especially when the max_trials parameter is small. When randomness causes the model to select a suboptimal architecture during the hyperparameter search phase or initialize an unfavorable weight set, even if the input data and tag processing seem to be correct, it can lead to a sharp decline in performance. This is the root cause of the observation in this case that the direct input of One-Hot encoding leads to low accuracy (0.40) and integer encoding leads to high accuracy (0.97)—different random seeds lead to different hyperparameter search paths and final models.

Strategies to ensure the reproducibility of Autokeras models

In order to solve the performance fluctuation problem caused by randomness and ensure the reproducibility of experimental results, we need to explicitly set up random seeds. Just setting the seed parameter in the StructuredDataClassifier constructor may not be enough to fully control all random sources. A more comprehensive approach is to use the tools provided by Keras to set up global random seeds.

Here are the recommended steps to ensure the reproducibility of Autokeras models:

Globally set random seeds: At the beginning of the script, use keras.utils.set_random_seed() to set all random seeds involving Keras and TensorFlow operations.

 import numpy as np
import tensorflow as tf
import os
import autokeras as ak
import keras # import keras

# Set random seeds to ensure reproducibility random_seed = 42 # Choose an integer you like keras.utils.set_random_seed(random_seed)
tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True) # If using GPU, optional

Specify seed and overwrite mode when initializing the Autokeras classifier: In addition to setting the seed parameter, it is also recommended to set overwrite=True when initializing the StructuredDataClassifier. overwrite=True ensures that every run is run will be searched from scratch without loading the results of the previous run, thus avoiding potential interference.
```
 # Initialize the structured data classifier# overwrite=True Make sure that search is restarted every time you run, and the previous result is not loaded# seed parameter further ensures that the randomness inside autokeras is controllable clf = ak.StructuredDataClassifier(overwrite=True, max_trials=10, seed=random_seed)
```
Increase max_trials to stabilize the result (optional but recommended): the max_trials parameter determines the number of different model architectures and hyperparameter combinations that Autokeras tries. When max_trials are small (such as the default 10), hyperparameter search may not be sufficient, resulting in the results being very sensitive to random seeds. Increasing max_trials (for example, setting to 50 or 100) can make the search process more comprehensive, thereby increasing the probability of finding a stable and high-performance model and reducing the fluctuations in the result caused by different random seeds.

Optimized label coding practice

Although Autokeras is able to handle One-Hot encoding internally, for code clarity and consistent with the conventions of most classification APIs, it is recommended to convert One-Hot encoded tags to integer tags before passing data to StructuredDataClassifier. This simplifies the output_signature definition of tf.data.Dataset.from_generator and makes the meaning of the tag more intuitive.

Here is a sample code snippet of converting to integer tags:

 import numpy as np
import tensorflow as tf
import os
import autokeras as ak
import keras

# Set random seed random_seed = 42
keras.utils.set_random_seed(random_seed)

N_FEATURES = 8
N_CLASSES = 3
BATCH_SIZE = 100

def get_data_generator(folder_path, batch_size, n_features):
    """
    Gets a data generator that returns data in batches from the .npy file in the specified folder.
    The shape of the feature is (batch_size, n_features).
    The shape of the label is (batch_size,), which is an integer.
    """
    def data_generator():
        files = os.listdir(folder_path)
        npy_files = [f for f in files if f.endswith('.npy')]

        for npy_file in npy_files:
            data = np.load(os.path.join(folder_path, npy_file))
            x = data[:, :n_features]
            y_ohe = data[:, n_features:]
            y_int = np.argmax(y_ohe, axis=1) # Convert One-Hot encoding to integer tags for i in range(0, len(x), batch_size):
                yield x[i:i batch_size], y_int[i:i batch_size]

    return data_generator

train_data_folder = '/home/my_user_name/original_data/train_data_npy'
validation_data_folder = '/home/my_user_name/original_data/valid_data_npy'

# Create the training dataset with the label as 1D integer train_dataset = tf.data.Dataset.from_generator(
    get_data_generator(train_data_folder, BATCH_SIZE, N_FEATURES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.int32) # Tag is now a 1D integer)
)

# Create a validation dataset with the label as 1D integer validation_dataset = tf.data.Dataset.from_generator(
    get_data_generator(validation_data_folder, BATCH_SIZE, N_FEATURES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.int32) # Tag is now a 1D integer)
)

# Initialize the classifier and set the random seed and overwrite mode clf = ak.StructuredDataClassifier(overwrite=True, max_trials=10, seed=random_seed)

# Training classifier clf.fit(train_dataset, epochs=100)

# Evaluate model print("Model evaluation results:", clf.evaluate(validation_dataset))

# Export and save the model (optional)
model = clf.export_model()
model.save("heca_v2_model_reproducible", save_format='tf')

Summarize

When the Autokeras model exhibits significant performance differences across runs, even if the tag encoding method seems reasonable, the root cause is often that random seeds are not properly managed. Autokeras's StructuredDataClassifier is able to internally process integer tags and perform One-Hot conversions, so providing One-Hot encoded tags is usually not a direct reason for poor performance. By setting random seeds globally at the beginning of the script, specifying seeds at the initialization of the classifier and setting overwrite=True, the reproducibility of model training can be effectively improved. In addition, appropriately increasing the max_trials parameter, and always converting One-Hot encoded tags into integers and then inputting the model are best practices for building a stable and trusted AutoML workflow.

The above is the detailed content of The impact of tag encoding and random seeds on model performance and reproducibility strategy in Autokeras. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress images for free

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

ArtGPT

AI image generator for creative art from text prompts.

Stock Market GPT

AI powered investment research for smarter decisions

Hot Article

How To Play The Bing Homepage Quiz And Win (Quick Guide)

3 weeks ago By DDD

How To Get Help In Windows 11 & 10 (Quick Guide)

2 weeks ago By DDD

Why can't I log into my Facebook account?

3 weeks ago By 下次还敢

How to fix 'The request failed due to a fatal device hardware error'

3 weeks ago By 下次还敢

How To Create A Desktop Shortcut In Windows 11/10 (Quick Guide)

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Related knowledge

Efficient merge strategy of PEFT LoRA adapter and base model Sep 19, 2025 pm 05:12 PM

This tutorial details how to efficiently merge the PEFT LoRA adapter with the base model to generate a completely independent model. The article points out that it is wrong to directly use transformers.AutoModel to load the adapter and manually merge the weights, and provides the correct process to use the merge_and_unload method in the peft library. In addition, the tutorial also emphasizes the importance of dealing with word segmenters and discusses PEFT version compatibility issues and solutions.

How to install packages from a requirements.txt file in Python Sep 18, 2025 am 04:24 AM

Run pipinstall-rrequirements.txt to install the dependency package. It is recommended to create and activate the virtual environment first to avoid conflicts, ensure that the file path is correct and that the pip has been updated, and use options such as --no-deps or --user to adjust the installation behavior if necessary.

How to test Python code with pytest Sep 20, 2025 am 12:35 AM

Python is a simple and powerful testing tool in Python. After installation, test files are automatically discovered according to naming rules. Write a function starting with test_ for assertion testing, use @pytest.fixture to create reusable test data, verify exceptions through pytest.raises, supports running specified tests and multiple command line options, and improves testing efficiency.

How to handle command line arguments in Python Sep 21, 2025 am 03:49 AM

Theargparsemoduleistherecommendedwaytohandlecommand-lineargumentsinPython,providingrobustparsing,typevalidation,helpmessages,anderrorhandling;usesys.argvforsimplecasesrequiringminimalsetup.

Floating point number accuracy problem in Python and its high-precision calculation scheme Sep 19, 2025 pm 05:57 PM

This article aims to explore the common problem of insufficient calculation accuracy of floating point numbers in Python and NumPy, and explains that its root cause lies in the representation limitation of standard 64-bit floating point numbers. For computing scenarios that require higher accuracy, the article will introduce and compare the usage methods, features and applicable scenarios of high-precision mathematical libraries such as mpmath, SymPy and gmpy to help readers choose the right tools to solve complex accuracy needs.

How to work with PDF files in Python Sep 20, 2025 am 04:44 AM

PyPDF2, pdfplumber and FPDF are the core libraries for Python to process PDF. Use PyPDF2 to perform text extraction, merging, splitting and encryption, such as reading the page through PdfReader and calling extract_text() to get content; pdfplumber is more suitable for retaining layout text extraction and table recognition, and supports extract_tables() to accurately capture table data; FPDF (recommended fpdf2) is used to generate PDF, and documents are built and output through add_page(), set_font() and cell(). When merging PDFs, PdfWriter's append() method can integrate multiple files

python get current time example Sep 15, 2025 am 02:32 AM

Getting the current time can be implemented in Python through the datetime module. 1. Use datetime.now() to obtain the local current time, 2. Use strftime("%Y-%m-%d%H:%M:%S") to format the output year, month, day, hour, minute and second, 3. Use datetime.now().time() to obtain only the time part, 4. It is recommended to use datetime.now(timezone.utc) to obtain UTC time, avoid using deprecated utcnow(), and daily operations can meet the needs by combining datetime.now() with formatted strings.

How can you create a context manager using the @contextmanager decorator in Python? Sep 20, 2025 am 04:50 AM

Import@contextmanagerfromcontextlibanddefineageneratorfunctionthatyieldsexactlyonce,wherecodebeforeyieldactsasenterandcodeafteryield(preferablyinfinally)actsas__exit__.2.Usethefunctioninawithstatement,wheretheyieldedvalueisaccessibleviaas,andthesetup

See all articles