Efficiently map Pandas DataFrame multi-column data based on key-value conditions-Python Tutorial-php.cn

Table of Contents

1. Problem background and limitations of traditional methods

2. Vector mapping method based on Boolean mask

2.1 Core ideas

2.2 Implementation steps and code examples

Home

Backend Development

Python Tutorial

Efficiently map Pandas DataFrame multi-column data based on key-value conditions

Susan Sarandon

Dec 25, 2025 am 09:33 AM

Efficiently map Pandas DataFrame multi-column data based on key-value conditions

This article describes how to use Pandas and NumPy to efficiently conditionally map and process multiple columns of data based on the value of the "key" column in a DataFrame. In view of the inefficiency of the traditional `numpy.select` column-by-column operation, the tutorial will show how to implement vectorization operations by constructing a Boolean mask combined with the `DataFrame.where()` method, thereby optimizing the data cleaning and transformation process and replacing column values that do not meet the conditions with specified markers (such as 'NA').

1. Problem background and limitations of traditional methods

In data processing, we often need to conditionally modify or retain the data of other columns in the DataFrame based on the value of a certain "key" column. For example, when the "key" column is 'key1', we may only care about the values of 'colA' and 'colD', while the other columns should be marked as invalid.

The following is a typical scenario and the traditional implementation using numpy.select:

 import pandas as pd
import numpy as np

#Create a sample DataFrame
data = {
    'key': ['key1', 'key2', 'key3', 'key1', 'key2'],
    'colA': ['value1A', 'value2A', 'value3A', 'value4A', 'value5A'],
    'colB': ['value1B', 'value2B', 'value3B', 'value4B', 'value5B'],
    'colC': ['value1C', 'value2C', 'value3C', 'value4C', 'value5C'],
    'colD': ['value1D', 'value2D', 'value3D', 'value4D', 'value5D']
}
df = pd.DataFrame(data)

# Traditional method: apply np.select to each column individually
df['colA'] = np.select([df['key'] == 'key1'], [df['colA']], default= 'NA')
df['colD'] = np.select([df['key'] == 'key1'], [df['colD']], default= 'NA')
df['colB'] = np.select([df['key'] == 'key2'], [df['colB']], default= 'NA')
df['colC'] = np.select([df['key'] == 'key3'], [df['colC']], default= 'NA')

print("Result of using np.select:")
print(df)

Output result:

 The result of using np.select:
    key colA colB colC colD
0 key1 value1A NA NA value1D
1 key2 NA value2B NA NA
2 key3 NA NA value3C NA
3 key1 value4A NA NA value4D
4 key2 NA value5B NA NA

Although this method can achieve its purpose, it has obvious limitations:

High repeatability: For each column that needs to be mapped, the np.select logic needs to be written repeatedly.
Poor scalability: When the number of columns to be processed is large, the code becomes verbose and difficult to maintain.
Efficiency issues: Although np.select is vectorized, multiple independent column operations are still not as efficient as processing all relevant columns at once.

In order to solve these problems, we need a more efficient and versatile vectorization method.

2. Vector mapping method based on Boolean mask

Pandas provides powerful tools to construct and apply Boolean masks to achieve efficient conditional modification of DataFrame. The core idea is to create a Boolean matrix similar in shape to the original DataFrame, with True values indicating that the original data should be retained and False values indicating that it should be replaced with a default value (such as 'NA').

2.1 Core ideas

Define mapping rules: Use a dictionary to explicitly specify which target columns are valid for each "key" value.
Generate Boolean mask: Convert the mapping rule into a Boolean DataFrame, where the rows represent the "keys", the columns represent the data columns, and True means that the column is valid under the key.
Align and apply mask: Align the resulting Boolean mask to the "key" column of the original DataFrame and then apply it to all target columns at once using the DataFrame.where() method.

2.2 Implementation steps and code examples

First, define our mapping rules, that is, which keys correspond to which columns are valid:

import pandas as pd
import numpy as np

# Recreate the original DataFrame
data = {
    'key': ['key1', 'key2', 'key3', 'key1', 'key2'],
    'colA': ['value1A', 'value2A', 'value3A', 'value4A', 'value5A'],
    'colB': ['value1B', 'value2B', 'value3B', 'value4B', 'value5B'],
    'colC': ['value1C', 'value2C', 'value3C', 'value4C', 'value5C'],
    'colD': ['value1D', 'value2D', 'value3D', 'value4D', 'value5D']
}
df = pd.DataFrame(data)

# 1. Define the mapping relationship between keys and target columns # For example: 'key1' corresponds to 'colA' and 'colD'

The above is the detailed content of Efficiently map Pandas DataFrame multi-column data based on key-value conditions. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress images for free

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undresser.AI Undress

AI-powered app for creating realistic nude photos

ArtGPT

AI image generator for creative art from text prompts.

Stock Market GPT

AI powered investment research for smarter decisions

Hot Article

How to use AI Speech Enhancement in Premiere? (Audio Cleanup Guide)

1 months ago By 下次还敢

How to correctly migrate jQuery's drag and drop events to native JavaScript

4 weeks ago By DDD

What is 'EVM'? The significance of the Ethereum Virtual Machine

1 months ago By DDD

The Notepad upgrade, cheaper YouTube TV, and Nova Launcher's new owner: News roundup

3 weeks ago By DDD

How to dynamically set arbitrary depth value of nested array in PHP

1 months ago By DDD

Popular tool

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Douyin level price list 1-75

20516

wifi shows no ip assigned

13629

Virtual mobile phone number to receive verification code

11965

Where is the login entrance for gmail email?

8983

How to turn off windows security center

8505

Related knowledge

Solve the error of multidict build failure when installing Python package Mar 08, 2026 am 02:51 AM

When installing libraries that depend on multidict in Python, such as aiohttp or discord.py, users may encounter the error "ERROR: Could not build wheels for multidict". This is usually due to the lack of the necessary C/C compiler or build tools, preventing pip from successfully compiling multidict's C extension from source. This article will provide a series of solutions, including installing system build tools, managing Python versions, and using virtual environments, to help developers effectively solve this problem.

How to run a Python script from Terminal? (Execution guide) Mar 04, 2026 am 03:20 AM

Python scripts need to be run explicitly with python3 or python command, and the file name cannot be entered directly; the "Nomodulenamed" report is a module path or environment problem; macOS needs to configure PATH or use python3; Windows cannot double-click to view errors, so it should be run in the terminal or add input().

How to find the sum of 5 numbers using Python's for loop Mar 10, 2026 pm 12:48 PM

This article explains in detail how to use a for loop to read 5 integers from user input and add them up, provide a concise and readable standard writing method, and compare efficient alternatives to built-in functions.

How to draw a histogram in Python_Multi-dimensional classification data comparison and stacked histogram color mapping implementation Mar 13, 2026 pm 12:18 PM

Multi-dimensional classification histograms need to manually calculate the x position and call plt.bar hierarchically; when stacking, bottom must be used to accumulate height, and xticks and ylim must be explicitly set (bottom=0); avoid mixing stacked=True and seaborn, and colors should be dynamically generated and strictly match the layer sequence.

How to use the Python zip function_Parallel traversal of multiple sequences and dictionary construction Mar 13, 2026 am 11:54 AM

The essence of zip is zipper pairing, which packs multiple iterable objects into tuples by position and does not automatically unpack the dictionary. When passing in a dictionary, its keys are traversed by default. You need to explicitly use the keys()/values()/items() view to correctly participate in parallel traversal.

Using Python Pandas to process Excel non-standard format data: cross-row cell merging techniques Mar 06, 2026 am 11:48 AM

This article details how to use the Python Pandas library to automate processing of non-standard data formats in Excel spreadsheets, specifically for those situations where the data content spans multiple consecutive rows but logically belongs to the same cell. By iteratively processing row pairs and conditionally merging data in specified columns, the information originally scattered in two rows is integrated into a list within a single cell, thereby converting non-standard format data into a standardized table structure for subsequent analysis and processing.

How Python manages dependencies_Comparison between pip and poetry Mar 12, 2026 pm 04:21 PM

pip is suitable for simple projects, which only install packages and do not isolate the environment; poetry is a modern tool that automatically manages dependencies, virtual environments and version locking. Use pip requirements.txt for small projects, and poetry is recommended for medium and large projects. The two cannot be mixed in the same project.

Python set intersection optimization_large data volume set operation skills Mar 13, 2026 pm 12:36 PM

The key to optimizing Python set intersection performance is to use the minimum set as the left operand, avoid implicit conversion, block processing and cache incremental updates. Priority should be given to using min(...,key=len) to select the smallest set, disabling multi-parameter intersection(), using frozenset or bloom filters to reduce memory, and using lru_cache to cache results in high-frequency scenarios.