Mage 是一款用于 ETL 任务的强大工具,具有支持数据探索和挖掘的功能、通过图形模板进行快速可视化以及其他一些功能,可将您的数据工作转变为神奇的东西。
处理数据时,在 ETL 过程中,通常会发现丢失的数据,这些数据可能会在将来产生问题,具体取决于我们要对数据集执行的活动,空数据可能会造成相当大的破坏。
为了识别数据集中是否缺少数据,我们可以使用 Python 和 pandas 库来检查出现空值的数据,此外我们还可以创建图表来更清楚地显示这些空值的影响我们的数据集。
我们的管道由 4 个步骤组成:从数据加载开始,两个处理步骤和数据导出。
在本文中,我们将使用数据集:有毒蘑菇的二进制预测,该数据集可在 Kaggle 上作为竞赛的一部分获得。让我们使用网站上提供的训练数据集。
让我们使用 python 创建一个数据加载器步骤,以便能够加载我们将要使用的数据。在此步骤之前,我在本地计算机上的 Postgres 数据库中创建了一个表,以便能够加载数据。由于数据位于 Postgres 中,我们将使用 Mage 中已定义的 Postgres 加载模板。
from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from os import path if 'data_loader' not in globals(): from mage_ai.data_preparation.decorators import data_loader if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @data_loader def load_data_from_postgres(*args, **kwargs): """ Template for loading data from a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ query = 'SELECT * FROM mushroom' # Specify your SQL query here config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: return loader.load(query) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
在函数 load_data_from_postgres() 中,我们将定义用于加载数据库中的表的查询。就我而言,我在文件 io_config.yaml 中配置了银行信息,它被定义为默认配置,因此我们只需将默认名称传递给变量 config_profile 即可。
执行该块后,我们将使用添加图表功能,该功能将通过已定义的模板提供有关我们的数据的信息。只需单击播放按钮旁边的图标(在图像中用黄线标记)即可。
我们将选择两个选项来进一步探索我们的数据集:summay_overview 和 feature_profiles 选项。通过summary_overview,我们可以获得数据集中的列数和行数信息,我们还可以按类型查看列的总数,例如分类列、数字列和布尔列的总数。另一方面,Feature_profiles 呈现了更多关于数据的描述性信息,例如:类型、最小值、最大值等信息,我们甚至可以将缺失值可视化,这是我们处理的重点。
为了能够更多地关注缺失数据,让我们使用模板:缺失值百分比,这是一个条形图,每列中包含缺失数据的百分比。
该图显示了 4 列,其中缺失值对应于其内容的 80% 以上,而其他列则呈现缺失值,但数量较少,这些信息现在允许我们寻求不同的策略来处理此问题空数据。
变压器跌落柱TRANSFORMER 块,我们将选择选项 列删除 .
from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.REMOVE Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns """ action = build_transformer_action( df, action_type=ActionType.REMOVE, arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'], axis=Axis.COLUMN, ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
execute_transformer_action() 中,我们将在参数变量中插入一个列表,其中包含要从数据集中排除的列的名称,在此步骤之后,只需执行该块即可。
变压器填充缺失值填充缺失值,在某些情况下,尽管存在缺失数据,但仍将其替换为诸如平均,或时尚,它可能能够满足数据需求,而不会对数据集造成太多更改,具体取决于您的最终目标。
Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.
Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidade Add chart do Mage. Usando o template Most frequent values podemos visualizar a moda e a frequência desse valor em cada uma das colunas.
Seguindos passos semelhantes aos anteriores, vamos usar o tranformer Fill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.
from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.IMPUTE Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values """ action = build_transformer_action( df, action_type=ActionType.IMPUTE, arguments=df.columns, # Specify columns to impute axis=Axis.COLUMN, options={'strategy': ImputationStrategy.MODE}, # Specify imputation strategy ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
Na função execute_transformer_action() , definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.
Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o bloco Data Exporter e selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivo io_config.yaml.
from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from pandas import DataFrame from os import path if 'data_exporter' not in globals(): from mage_ai.data_preparation.decorators import data_exporter @data_exporter def export_data_to_postgres(df: DataFrame, **kwargs) -> None: """ Template for exporting data to a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ schema_name = 'public' # Specify the name of the schema to export data to table_name = 'mushroom_clean' # Specify the name of the table to export data to config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: loader.export( df, schema_name, table_name, index=False, # Specifies whether to include index in exported table if_exists='replace', #Specify resolution policy if table name already exists )
repo -> https://github.com/DeadPunnk/Mushrooms/tree/main
以上是Magic Mushrooms:使用 Mage 探索和处理空数据的详细内容。更多信息请关注PHP中文网其他相关文章!