Both offer a wide range of tools and advantages that can make us doubt which of the two to choose at some point. It is not about changing all the company's processes so that they start using Polars or a “death” to Pandas (this is not going to happen in the immediate future). It is about knowing other tools that can help us reduce costs and time in the processes, obtaining the same or better results.
When we use cloud services we prioritize certain factors, including their cost. The services I use for this process are AWS Lambda with the Python 3.10 runtime and S3 to store the raw file and the parquet converted file.
The intention is to obtain a CSV file as raw data and process it with pandas and polars with the intention of verifying which of these two libraries offers us better optimization of resources such as memory and the weight of the resulting file.
Pandas
It is a Python library specialized in data manipulation and analysis, it is written in C and its initial release was in 2008.
*Polars *
It is a Python and Rust library specialized in data manipulation and analysis that allows parallel processes and is written mostly in Rust and was released in 2022.
The architecture of the process:
The project is something simple as shown in the architecture: The user deposits a CSV file in work/pandas or work/porlas and automatically starts the s3 trigger to process the file to convert it into parquet and deposit it in processed.
In this small project I used two lambdas with the following configuration:
Memory: 2 GB
Ephemeral memory: 2 GB
Life time: 600 seconds
Requirements
Lambda with pandas: Pandas, Numpy and Pyarrow
Lambda with polars: Polars
The dataset used for the comparison is available on kaggle under the name “Rotten Tomatoes Movie Reviews – 1.44M rows” or can be downloaded from here.
The full repository is available on GitHub and can be cloned here.
Size or Weight
The lambda that Pandas uses requires two more plugins to create a parquet file, in this case it is PyArrow and a specific version of numpy for the version of Pandas that I was using. As a result we obtained a lambda with a weight or size of 74.4 MB, something very close to the limit that AWS allows us for the weight of the lambda.
The lambda with Polars does not require another plugin like PyArrow which simplifies life and reduces the size of the lambda to less than half. As a result, our lambda has a weight or size of 30.6 MB compared to the first, giving us room to install other dependencies that we may need for our transformation process.
Performance
The lambda with Pandas was optimized to use compression after the first version, however, its behavior was also analyzed.
Pandas
It took 18 seconds to process the dataset and used 1894 MB of memory to process the CSV file and generate a Parquet file compared to the other versions, it was the one that used the most time and resources.
Pandas + Compression
Adding a line of code allowed us to improve a little compared to the previous version (Pandas), it took 17 seconds to process the dataset and used 1837 MB, which does not represent a significant improvement in processing and computational time, but in size. of the resulting file.
Polars
It took 12 seconds to process the same dataset and I used only 1462 MB, compared to the previous two it represents a time saving of 44.44% and a lower memory consumption.
Output file size
Pandas
The lambda in which a compression process was not established generated a parquet file of 177.4 MB.
Pandas + Compression
When configuring compression in the lambda I do not generate a 121.1 MB parquet file. One small line or option helped us reduce the file size by 31.74%. Considering that it is not a significant code change, it is a very good option.
Polars
Polars generated a 105.8 MB file that, purchased with the first version of Pandas, represents a saving of 40.36% and 12.63% against the Pandas version with compression.
Conclusion
It is not necessary to change all the internal processes that use Pandas so that they now use Polars, however, it is important to consider that if we are talking about thousands or millions of lambda executions, using Polars will help us not only with the deployment time but will also help us to have a lower cost due to the time-based charging that AWS makes for Serverless services such as Lambda.
Likewise, when we translate that 40.36% into millions of files we are talking about GBs or TBs, something that would have a significant impact within a Datalake or Dataware house or even in a cold file storage.
The reduction with Polars would not only be limited to these two factors, because it would greatly affect the output of data and/or objects from AWS because it is a service that does have a cost.
The above is the detailed content of What is faster and cheaper to convert files in AWS: Polar or Pandas?. For more information, please follow other related articles on the PHP Chinese website!