Home>Article>Backend Development> Magic Mushrooms: exploring and treating null data with Mage

Magic Mushrooms: exploring and treating null data with Mage

王林
王林 Original
2024-08-18 06:02:02 576browse

Mage is a powerful tool for ETL tasks, with features that enable data exploration and mining, quick visualizations through graph templates and several other features that transform your work with data into something magical.

When processing data, during an ETL process it is common to find missing data that can generate problems in the future, depending on the activity we are going to carry out with the dataset, null data can be quite disruptive.

To identify the absence of data in our dataset, we can use Python and the pandas library to check the data that presents null values, in addition we can create graphs that show even more clearly the impact of these null values in our dataset.

Our pipeline consists of 4 steps: starting with data loading, two processing steps and data export.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Data Loader

In this article we will use the dataset: Binary Prediction of Poisonous Mushrooms which is available on Kaggle as part of a competition. Let's use the training dataset available on the website.

We will create a Data Loader step using python to be able to load the data we are going to use. Before this step, I created a table in the Postgres database, which I have locally on my machine, to be able to load the data. As the data is in Postgres, we will use the already defined Postgres load template within Mage.

from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from os import path if 'data_loader' not in globals(): from mage_ai.data_preparation.decorators import data_loader if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @data_loader def load_data_from_postgres(*args, **kwargs): """ Template for loading data from a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ query = 'SELECT * FROM mushroom' # Specify your SQL query here config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: return loader.load(query) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
e

Within the functionload_data_from_postgres()we will define the query that we will use to load the table in the database. In my case, I configured the bank information in the fileio_config.yamlwhere it is defined as the default configuration, so we only need to pass the default name to the variableconfig_profile.

After executing the block, we will use the Add chart feature, which will provide information about our data through already defined templates. Just click on the icon next to the play button, marked in the image with the yellow line.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

We will select two options to explore our dataset further, the summay_overview and feature_profiles options. Through summary_overview, we obtain information about the number of columns and rows in the dataset. We can also view the total number of columns by type, for example the total number of categorical, numeric and Boolean columns. Feature_profiles, on the other hand, presents more descriptive information about the data, such as: type, minimum value, maximum value, among other information, and we can even visualize the missing values, which are the focus of our treatment.

To be able to focus more on missing data, let's use the template: % of missing values, a bar graph with the percentage of data that is missing, in each of the columns.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

The graph presents 4 columns where missing values correspond to more than 80% of its content, and other columns that present missing values but in a smaller quantity, this information now allows us to seek different strategies to deal with this null data.

Transformer Drop Columns

For columns that have more than 80% of null values, the strategy we will follow will be to perform a drop columns in the dataframe, selecting the columns that we are going to exclude from the dataframe. Using theTRANSFORMERBlock in the Python language, we will select the optionColum removal.

from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.REMOVE Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns """ action = build_transformer_action( df, action_type=ActionType.REMOVE, arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'], axis=Axis.COLUMN, ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
e

Within the functionexecute_transformer_action()we will insert a list with the name of the columns that we want to exclude from the dataset, in the arguments variable, after this step, just execute the block.

Transformer Fill in Missing Values

Now for the columns that have less than 80% of null values, we will use the strategyFill in Missing Values, as in some cases, despite having missing data, replacing these with values such as mean, or mode, can be covers to meet the need for data without causing many changes to the dataset, depending on its final objective.

Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.

Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidadeAdd chartdo Mage. Usando o templateMost frequent valuespodemos visualizar a moda e a frequência desse valor em cada uma das colunas.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Seguindos passos semelhantes aos anteriores, vamos usar o tranformerFill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.

from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.IMPUTE Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values """ action = build_transformer_action( df, action_type=ActionType.IMPUTE, arguments=df.columns, # Specify columns to impute axis=Axis.COLUMN, options={'strategy': ImputationStrategy.MODE}, # Specify imputation strategy ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'

Na funçãoexecute_transformer_action(), definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.

Data Exporter

Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o blocoData Exportere selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivoio_config.yaml.

from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from pandas import DataFrame from os import path if 'data_exporter' not in globals(): from mage_ai.data_preparation.decorators import data_exporter @data_exporter def export_data_to_postgres(df: DataFrame, **kwargs) -> None: """ Template for exporting data to a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ schema_name = 'public' # Specify the name of the schema to export data to table_name = 'mushroom_clean' # Specify the name of the table to export data to config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: loader.export( df, schema_name, table_name, index=False, # Specifies whether to include index in exported table if_exists='replace', #Specify resolution policy if table name already exists )

Obrigado e até a próxima ?

repo -> https://github.com/DeadPunnk/Mushrooms/tree/main

The above is the detailed content of Magic Mushrooms: exploring and treating null data with Mage. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn