How to Split a Vector Column into Columns in PySpark?-Python Tutorial-php.cn

How to Split a Vector Column into Columns in PySpark?

Susan Sarandon

Release： 2024-11-01 01:06:01

Original

1078 people have browsed it

How to Split a Vector Column into Columns in PySpark?

Splitting Vector Column into Columns using PySpark

You have a PySpark DataFrame with two columns: word and vector, where vector is a VectorUDT column. Your goal is to split the vector column into multiple columns, each representing one dimension of the vector.

Solution:

Spark >= 3.0.0

In Spark versions 3.0.0 and above, you can use the vector_to_array function to achieve this:

<code class="python">from pyspark.ml.functions import vector_to_array

(df
    .withColumn("xs", vector_to_array("vector")))
    .select(["word"] + [col("xs")[i] for i in range(3)]))</code>

Copy after login

This will create new columns named word and xs[0], xs[1], xs[2], and so on, representing the values of the original vector.

Spark < 3.0.0

For older Spark versions, you can follow these approaches:

Convert to RDD and Extract

<code class="python">from pyspark.ml.linalg import Vectors

df = sc.parallelize([
    ("assert", Vectors.dense([1, 2, 3])),
    ("require", Vectors.sparse(3, {1: 2}))
]).toDF(["word", "vector"])

def extract(row):
    return (row.word, ) + tuple(row.vector.toArray().tolist())

df.rdd.map(extract).toDF(["word"])  # Vector values will be named _2, _3, ...</code>

Copy after login

Create a UDF:

<code class="python">from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType

def to_array(col):
    def to_array_(v):
        return v.toArray().tolist()
    # Important: asNondeterministic requires Spark 2.3 or later
    # It can be safely removed i.e.
    # return udf(to_array_, ArrayType(DoubleType()))(col)
    # but at the cost of decreased performance
    return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)

(df
    .withColumn("xs", to_array(col("vector")))
    .select(["word"] + [col("xs")[i] for i in range(3)]))</code>

Copy after login

Either approach will result in a DataFrame with separate columns for each dimension of the original vector, making it easier to work with the data.

The above is the detailed content of How to Split a Vector Column into Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!