When dealing with datasets too large to fit in memory, out-of-core workflows are essential. In this context, we explore best practices for handling large data using pandas.
To efficiently manage large datasets, consider the following best-practice workflow:
Loading Flat Files into an On-Disk Database Structure:
Querying the Database to Retrieve Data into Pandas Data Structure:
Updating the Database After Manipulating Pieces in Pandas:
Example:
import pandas as pd # Group mappings for logical field grouping group_map = { "A": {"fields": ["field_1", "field_2"], "dc": ["field_1"]}, "B": {"fields": ["field_10"], "dc": ["field_10"]}, ... } # Iterate over flat files and append data to tables for file in files: chunk = pd.read_table(file, chunksize=50000) for group, info in group_map.items(): frame = chunk.reindex(columns=info["fields"], copy=False) store.append(group, frame, data_columns=info["dc"]) # Retrieve specific columns selected_columns = ["field_1", "field_10"] group_1 = "A" group_2 = "B" data = store.select_as_multiple([group_1, group_2], columns=selected_columns)
The above is the detailed content of How Can Pandas Handle Large Datasets That Exceed Available Memory?. For more information, please follow other related articles on the PHP Chinese website!