Home > Database > Mysql Tutorial > How to select a subset of data in SQL query style in Pandas?

How to select a subset of data in SQL query style in Pandas?

WBOY
Release: 2023-08-31 10:13:10
forward
1175 people have browsed it

如何在 Pandas 的 SQL 查询样式中选择数据子集?

Introduction

In this article, I will show you how to use Pandas to perform data analysis with SQL-style filtering. Most enterprise data is stored in databases that require SQL to retrieve and manipulate. For example, companies like Oracle, IBM, Microsoft have their own databases and their own SQL implementations.

Data scientists must deal with SQL at some point in their career because data is not always stored in CSV files. I personally prefer to use Oracle because most of my company's data is stored in Oracle.

Scenario – 1 Suppose we have a task to find all movies from our dataset that have the following criteria.

  • The language of the movie should be English (en) or Spanish (es).
  • The movie's popularity must be between 500 and 1000.
  • The movie's status must be Published.
  • The number of votes must be greater than 5000. For the above scenario, the SQL statement looks like the following.
SELECT
FROM WHERE
title AS movie_title
,original_language AS movie_language
,popularityAS movie_popularity
,statusAS movie_status
,vote_count AS movie_vote_count movies_data
original_languageIN ('en', 'es')

AND status=('Released')
AND popularitybetween 500 AND 1000
AND vote_count > 5000;
Copy after login

Now that you have seen the SQL statement that meets your needs, let us proceed step by step using pandas. I'm going to show you two ways.

Method 1: Boolean Index

1. Load the movies_data dataset into the DataFrame.

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
Copy after login

Assign a variable to each condition.

languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages )
condition_on_status = movies . status == "Released"
condition_on_popularity = movies . popularity . between ( 500 , 1000 )
condition_on_votecount = movies . vote_count > 5000
Copy after login

3. Combine all conditions (boolean array) together.

final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount )
columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ]
# clubbing all together movies . loc [ final_conditions , columns ]
Copy after login

zhpublished724.24778410867##788Deadpool
Title

original_language

Status

Popularity

vote_count

< p>
##95星

< /td>

zh

Released

514.569956

10995

Method 2:- .query() method.

.query() method is a method for filtering data in SQL where clause style. Conditions can be passed to this method as strings, however, column names must not contain any spaces.

If there are spaces in the column names, use the python replace function to replace them with underscores.

In my experience, I have found that the query() method is faster than previous methods when applied to larger DataFrames. < /p>

import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )
Copy after login

4. Build the query string and execute the method.

Please note that the .query method does not work with triple quoted strings that span multiple lines.

final_conditions = (
"original_language in [&#39;en&#39;,&#39;es&#39;]"
"and status == &#39;Released&#39; "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000"
) final_result = movies . query ( final_conditions )
final_result
Copy after login

## 95##67512001758000000< td>

Deadpool

## There are many more, usually in my coding I have Multiple values ​​to check my "in" clause. So the syntax above is not ideal. Python variables can be referenced using the at symbol (@). You can also programmatically create values ​​as Python lists and use them with (@).

##Budget

< /p>id

original_language

original_title

Popularity

Release Date

Revenue

Runtime

st

165000000

157336

zh

724.247784

May 11, 2014

< /p>

##169.0

Relationship

##788

293660

zh

514.569956

September 2, 2016

783112979

108.0

Relationship

movie_languages = [ &#39;en&#39; , &#39;es&#39; ]
final_conditions = (
"original_language in @movie_languages "
"and status == &#39;Released&#39; "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000" )
final_result = movies . query ( final_conditions )
final_result
Copy after login

##Budget

idoriginal_languageoriginal_titlePopularityRelease DateRevenueRuntimest## 95165000000293660514.569956

157336

zh

724.247784

May 11, 2014

675120017

169.0

Relationship

##788

##58000000

zh

Deadpool

September 2, 2016

783112979

108.0 p>

Relationship

The above is the detailed content of How to select a subset of data in SQL query style in Pandas?. For more information, please follow other related articles on the PHP Chinese website!

source:tutorialspoint.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template