
Splitting a Vector Column into Rows in PySpark
In PySpark, splitting a column containing vector values into separate columns for each dimension is a common task. This article will guide you through different approaches to achieve this:
Spark 3.0.0 and Above
Spark 3.0.0 introduced the vector_to_array function, simplifying this process:
<code class="python">from pyspark.ml.functions import vector_to_array
df = df.withColumn("xs", vector_to_array("vector"))</code>You can then select the desired columns:
<code class="python">df.select(["word"] + [col("xs")[i] for i in range(3)])</code>Spark Less Than 3.0.0
Approach 1: Converting to RDD
<code class="python">def extract(row):
return (row.word, ) + tuple(row.vector.toArray().tolist())
df.rdd.map(extract).toDF(["word"]) # Vector values will be named _2, _3, ...</code>Approach 2: Using a UDF
<code class="python">from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType
def to_array(col):
def to_array_(v):
return v.toArray().tolist()
return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)
df = df.withColumn("xs", to_array(col("vector")))</code>Select the desired columns:
<code class="python">df.select(["word"] + [col("xs")[i] for i in range(3)])</code>By implementing any of these methods, you can effectively split a vector column into individual columns, making it easier to work with and analyze your data.
The above is the detailed content of How to Split a Vector Column into Rows in PySpark?. For more information, please follow other related articles on the PHP Chinese website!
Which brand does OnePlus mobile phone belong to?
How oracle rounds
Introduction to architectural drawing software
The difference between JD.com's self-operated and official flagship stores
access database purpose
How to set Chinese in vscode
ASCII code comparison table
What should I do if the mouse stops moving?