In Apache Spark, fetching data from external databases is often done by loading an entire table using the DataFrameReader. However, sometimes it may be desirable to fetch only the results of a specific query.
In Apache Spark 2.0.0, it is possible to specify a subquery as the dbtable argument when reading from a JDBC source. This allows you to fetch the results of a specific query rather than the entire table.
Consider the following code snippet written in Pyspark:
from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("spark play")\ .getOrCreate() df = spark.read\ .format("jdbc")\ .option("url", "jdbc:mysql://localhost:port")\ .option("dbtable", "(SELECT foo, bar FROM schema.tablename) AS tmp")\ .option("user", "username")\ .option("password", "password")\ .load()
In this example, instead of fetching the entire schema.tablename table, the code executes the subquery (SELECT foo, bar FROM schema.tablename) AS tmp and stores the results in the temporary table tmp. The DataFrameReader then loads the data from the temporary table tmp into the DataFrame df.
The above is the detailed content of How to Fetch Specific Query Results from External Databases in Apache Spark 2.0.0?. For more information, please follow other related articles on the PHP Chinese website!