


The impact of tag encoding and random seeds on model performance and reproducibility strategy in Autokeras
Tag processing mechanism in Autokeras
In machine learning classification tasks, label encoding is a key step in data preprocessing. Common encoding methods include One-Hot encoding and integer encoding. For Autokeras' StructuredDataClassifier, it is designed to handle classification tasks, and is usually expected to receive category tags in the form of integers. Even if you provide One-Hot encoded tags, Autokeras treats it as a classification problem when it is processed internally and converts and processes accordingly in its internal pipeline.
In fact, autokeras will convert it into One-Hot encoding after receiving the integer tags, in order to be compatible with loss functions, such as CategoricalCrossentropy, which are commonly used in multi-classification tasks. You can verify if the OneHotEncoder object exists in its preprocessor chain by checking clf.outputs[0].in_blocks[0].get_hyper_preprocessors(), and confirm the loss function used by clf.outputs[0].in_blocks[0].loss. This means that whether you provide the original One-Hot encoding or the converted integer tags, the internal label representation and loss functions used by the final model training are usually consistent. Therefore, when a huge performance difference is observed between the two (for example from 0.40 to 0.97), the problem is often not the “correctness” of the tag encoding, but other factors.
Random seeds and model reproducibility
As an automated machine learning (AutoML) tool, Autokeras performs a large number of random operations when looking for the best model architecture and hyperparameters, such as:
- Hyperparameter search space exploration: Different random initializations may cause search algorithms to explore different combinations of hyperparameters.
- Model weight initialization: The initial weight of a neural network is usually random.
- Data shuffle: Training data is usually randomly shuffled before each epoch starts.
- Dropout layer: The Dropout operation itself is random.
These randomnesses can produce different results each time the code is run, especially when the max_trials parameter is small. When randomness causes the model to select a suboptimal architecture during the hyperparameter search phase or initialize an unfavorable weight set, even if the input data and tag processing seem to be correct, it can lead to a sharp decline in performance. This is the root cause of the observation in this case that the direct input of One-Hot encoding leads to low accuracy (0.40) and integer encoding leads to high accuracy (0.97)—different random seeds lead to different hyperparameter search paths and final models.
Strategies to ensure the reproducibility of Autokeras models
In order to solve the performance fluctuation problem caused by randomness and ensure the reproducibility of experimental results, we need to explicitly set up random seeds. Just setting the seed parameter in the StructuredDataClassifier constructor may not be enough to fully control all random sources. A more comprehensive approach is to use the tools provided by Keras to set up global random seeds.
Here are the recommended steps to ensure the reproducibility of Autokeras models:
-
Globally set random seeds: At the beginning of the script, use keras.utils.set_random_seed() to set all random seeds involving Keras and TensorFlow operations.
import numpy as np import tensorflow as tf import os import autokeras as ak import keras # import keras # Set random seeds to ensure reproducibility random_seed = 42 # Choose an integer you like keras.utils.set_random_seed(random_seed) tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True) # If using GPU, optional
-
Specify seed and overwrite mode when initializing the Autokeras classifier: In addition to setting the seed parameter, it is also recommended to set overwrite=True when initializing the StructuredDataClassifier. overwrite=True ensures that every run is run will be searched from scratch without loading the results of the previous run, thus avoiding potential interference.
# Initialize the structured data classifier# overwrite=True Make sure that search is restarted every time you run, and the previous result is not loaded# seed parameter further ensures that the randomness inside autokeras is controllable clf = ak.StructuredDataClassifier(overwrite=True, max_trials=10, seed=random_seed)
Increase max_trials to stabilize the result (optional but recommended): the max_trials parameter determines the number of different model architectures and hyperparameter combinations that Autokeras tries. When max_trials are small (such as the default 10), hyperparameter search may not be sufficient, resulting in the results being very sensitive to random seeds. Increasing max_trials (for example, setting to 50 or 100) can make the search process more comprehensive, thereby increasing the probability of finding a stable and high-performance model and reducing the fluctuations in the result caused by different random seeds.
Optimized label coding practice
Although Autokeras is able to handle One-Hot encoding internally, for code clarity and consistent with the conventions of most classification APIs, it is recommended to convert One-Hot encoded tags to integer tags before passing data to StructuredDataClassifier. This simplifies the output_signature definition of tf.data.Dataset.from_generator and makes the meaning of the tag more intuitive.
Here is a sample code snippet of converting to integer tags:
import numpy as np import tensorflow as tf import os import autokeras as ak import keras # Set random seed random_seed = 42 keras.utils.set_random_seed(random_seed) N_FEATURES = 8 N_CLASSES = 3 BATCH_SIZE = 100 def get_data_generator(folder_path, batch_size, n_features): """ Gets a data generator that returns data in batches from the .npy file in the specified folder. The shape of the feature is (batch_size, n_features). The shape of the label is (batch_size,), which is an integer. """ def data_generator(): files = os.listdir(folder_path) npy_files = [f for f in files if f.endswith('.npy')] for npy_file in npy_files: data = np.load(os.path.join(folder_path, npy_file)) x = data[:, :n_features] y_ohe = data[:, n_features:] y_int = np.argmax(y_ohe, axis=1) # Convert One-Hot encoding to integer tags for i in range(0, len(x), batch_size): yield x[i:i batch_size], y_int[i:i batch_size] return data_generator train_data_folder = '/home/my_user_name/original_data/train_data_npy' validation_data_folder = '/home/my_user_name/original_data/valid_data_npy' # Create the training dataset with the label as 1D integer train_dataset = tf.data.Dataset.from_generator( get_data_generator(train_data_folder, BATCH_SIZE, N_FEATURES), output_signature=( tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32), tf.TensorSpec(shape=(None,), dtype=tf.int32) # Tag is now a 1D integer) ) # Create a validation dataset with the label as 1D integer validation_dataset = tf.data.Dataset.from_generator( get_data_generator(validation_data_folder, BATCH_SIZE, N_FEATURES), output_signature=( tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32), tf.TensorSpec(shape=(None,), dtype=tf.int32) # Tag is now a 1D integer) ) # Initialize the classifier and set the random seed and overwrite mode clf = ak.StructuredDataClassifier(overwrite=True, max_trials=10, seed=random_seed) # Training classifier clf.fit(train_dataset, epochs=100) # Evaluate model print("Model evaluation results:", clf.evaluate(validation_dataset)) # Export and save the model (optional) model = clf.export_model() model.save("heca_v2_model_reproducible", save_format='tf')
Summarize
When the Autokeras model exhibits significant performance differences across runs, even if the tag encoding method seems reasonable, the root cause is often that random seeds are not properly managed. Autokeras's StructuredDataClassifier is able to internally process integer tags and perform One-Hot conversions, so providing One-Hot encoded tags is usually not a direct reason for poor performance. By setting random seeds globally at the beginning of the script, specifying seeds at the initialization of the classifier and setting overwrite=True, the reproducibility of model training can be effectively improved. In addition, appropriately increasing the max_trials parameter, and always converting One-Hot encoded tags into integers and then inputting the model are best practices for building a stable and trusted AutoML workflow.
The above is the detailed content of The impact of tag encoding and random seeds on model performance and reproducibility strategy in Autokeras. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

ArtGPT
AI image generator for creative art from text prompts.

Stock Market GPT
AI powered investment research for smarter decisions

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

This tutorial details how to efficiently merge the PEFT LoRA adapter with the base model to generate a completely independent model. The article points out that it is wrong to directly use transformers.AutoModel to load the adapter and manually merge the weights, and provides the correct process to use the merge_and_unload method in the peft library. In addition, the tutorial also emphasizes the importance of dealing with word segmenters and discusses PEFT version compatibility issues and solutions.

Run pipinstall-rrequirements.txt to install the dependency package. It is recommended to create and activate the virtual environment first to avoid conflicts, ensure that the file path is correct and that the pip has been updated, and use options such as --no-deps or --user to adjust the installation behavior if necessary.

Python is a simple and powerful testing tool in Python. After installation, test files are automatically discovered according to naming rules. Write a function starting with test_ for assertion testing, use @pytest.fixture to create reusable test data, verify exceptions through pytest.raises, supports running specified tests and multiple command line options, and improves testing efficiency.

Theargparsemoduleistherecommendedwaytohandlecommand-lineargumentsinPython,providingrobustparsing,typevalidation,helpmessages,anderrorhandling;usesys.argvforsimplecasesrequiringminimalsetup.

This article aims to explore the common problem of insufficient calculation accuracy of floating point numbers in Python and NumPy, and explains that its root cause lies in the representation limitation of standard 64-bit floating point numbers. For computing scenarios that require higher accuracy, the article will introduce and compare the usage methods, features and applicable scenarios of high-precision mathematical libraries such as mpmath, SymPy and gmpy to help readers choose the right tools to solve complex accuracy needs.

PyPDF2, pdfplumber and FPDF are the core libraries for Python to process PDF. Use PyPDF2 to perform text extraction, merging, splitting and encryption, such as reading the page through PdfReader and calling extract_text() to get content; pdfplumber is more suitable for retaining layout text extraction and table recognition, and supports extract_tables() to accurately capture table data; FPDF (recommended fpdf2) is used to generate PDF, and documents are built and output through add_page(), set_font() and cell(). When merging PDFs, PdfWriter's append() method can integrate multiple files

Getting the current time can be implemented in Python through the datetime module. 1. Use datetime.now() to obtain the local current time, 2. Use strftime("%Y-%m-%d%H:%M:%S") to format the output year, month, day, hour, minute and second, 3. Use datetime.now().time() to obtain only the time part, 4. It is recommended to use datetime.now(timezone.utc) to obtain UTC time, avoid using deprecated utcnow(), and daily operations can meet the needs by combining datetime.now() with formatted strings.

Import@contextmanagerfromcontextlibanddefineageneratorfunctionthatyieldsexactlyonce,wherecodebeforeyieldactsasenterandcodeafteryield(preferablyinfinally)actsas__exit__.2.Usethefunctioninawithstatement,wheretheyieldedvalueisaccessibleviaas,andthesetup
