How to select a subset of data in SQL query style in Pandas?-Mysql Tutorial-php.cn

How to select a subset of data in SQL query style in Pandas?

WBOY

Release： 2023-08-31 10:13:10

forward

1175 people have browsed it

如何在 Pandas 的 SQL 查询样式中选择数据子集？

Introduction

In this article, I will show you how to use Pandas to perform data analysis with SQL-style filtering. Most enterprise data is stored in databases that require SQL to retrieve and manipulate. For example, companies like Oracle, IBM, Microsoft have their own databases and their own SQL implementations.

Data scientists must deal with SQL at some point in their career because data is not always stored in CSV files. I personally prefer to use Oracle because most of my company's data is stored in Oracle.

Scenario – 1 Suppose we have a task to find all movies from our dataset that have the following criteria.

The language of the movie should be English (en) or Spanish (es).
The movie's popularity must be between 500 and 1000.
The movie's status must be Published.
The number of votes must be greater than 5000. For the above scenario, the SQL statement looks like the following.

SELECT
FROM WHERE
title AS movie_title
,original_language AS movie_language
,popularityAS movie_popularity
,statusAS movie_status
,vote_count AS movie_vote_count movies_data
original_languageIN (&#39;en&#39;, &#39;es&#39;)

AND status=(&#39;Released&#39;)
AND popularitybetween 500 AND 1000
AND vote_count > 5000;

Copy after login

Now that you have seen the SQL statement that meets your needs, let us proceed step by step using pandas. I'm going to show you two ways.

Method 1: Boolean Index

1. Load the movies_data dataset into the DataFrame.

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

Copy after login

Assign a variable to each condition.

languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages )
condition_on_status = movies . status == "Released"
condition_on_popularity = movies . popularity . between ( 500 , 1000 )
condition_on_votecount = movies . vote_count > 5000

Copy after login

3. Combine all conditions (boolean array) together.

final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount )
columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ]
# clubbing all together movies . loc [ final_conditions , columns ]

Copy after login

zhpublished724.24778410867##788Deadpool

Title	original_language	Status	Popularity	vote_count < p>
##95星				< /td>
zh	Released	514.569956	10995

Title

original_language

Status

Popularity

vote_count

< p>

##95星

< /td>

Released

514.569956

10995

Method 2:- .query() method.

.query() method is a method for filtering data in SQL where clause style. Conditions can be passed to this method as strings, however, column names must not contain any spaces.

If there are spaces in the column names, use the python replace function to replace them with underscores.

In my experience, I have found that the query() method is faster than previous methods when applied to larger DataFrames. < /p>

import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )

Copy after login

4. Build the query string and execute the method.

Please note that the .query method does not work with triple quoted strings that span multiple lines.

final_conditions = (
"original_language in [&#39;en&#39;,&#39;es&#39;]"
"and status == &#39;Released&#39; "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000"
) final_result = movies . query ( final_conditions )
final_result

Copy after login

## 95##67512001758000000< td>

Deadpool

## There are many more, usually in my coding I have Multiple values to check my "in" clause. So the syntax above is not ideal. Python variables can be referenced using the at symbol (@). You can also programmatically create values as Python lists and use them with (@).

##Budget	< /p>id	original_language	original_title	Popularity	Release Date	Revenue	Runtime	st
165000000	157336	zh	星	724.247784	May 11, 2014	< /p> ##169.0	Relationship		##788
293660	zh	514.569956	September 2, 2016	783112979	108.0	Relationship

##Budget

< /p>id

original_language

original_title

Popularity

Release Date

Revenue

Runtime

165000000

157336

星

724.247784

May 11, 2014

< /p>

##169.0

Relationship

##788

293660

514.569956

September 2, 2016

783112979

108.0

Relationship

movie_languages = [ &#39;en&#39; , &#39;es&#39; ]
final_conditions = (
"original_language in @movie_languages "
"and status == &#39;Released&#39; "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000" )
final_result = movies . query ( final_conditions )
final_result

Copy after login

##Budget

idoriginal_languageoriginal_titlePopularityRelease DateRevenueRuntimest## 95165000000293660514.569956


157336 zh	星	724.247784	May 11, 2014	675120017	169.0	Relationship		##788	##58000000
zh	Deadpool	September 2, 2016	783112979	108.0 p>	Relationship

The above is the detailed content of How to select a subset of data in SQL query style in Pandas?. For more information, please follow other related articles on the PHP Chinese website!