如何更新嵌套列

了解如何在Databricks中更新嵌套列。

最後發布時間:2022年5月31日

Spark不支持在嵌套結構中添加新列或刪除現有列。特別是,withColumn而且下降的方法數據集類不允許指定與任何頂級列不同的列名。例如，假設你有一個具有以下模式的數據集:

%scala val schema =(新的StructType). add("metadata"，(新的StructType). add("eventid"， "string"， true) .add("hostname"， "string"， true) .add("timestamp"， "string"， true) .add("items"，(新的StructType). add("books"，(新的StructType). add("metadata"，(新的StructType). add("eventid"， "string"， true) .add("timestamp"， "string"， true) .add("books"，(新的StructType). add("add("fees"， "double"， true)， true) .add("paper"， (new StructType)。add("pages"， "int"， true)， true)， true

模式看起來像這樣:

Root |——metadata: struct (nullable = true) | |——eventid: string (nullable = true) | |——hostname: string (nullable = true) | |——timestamp: string (nullable = true) |——items: struct (nullable = true) | |——books: struct (nullable = true) | | |——fees: double (nullable = true) | |——paper: struct (nullable = true) | | |——pages: integer (nullable = true)

假設你有DataFrame：

%scala val rdd: rdd [Row] = sc.parallelize(Seq(Row(Row("eventid1"， "hostname1"， "timestamp1")， Row(Row(Row(100.0)， Row(10))))) val df = spark. spark))createDataFrame(抽樣模式)顯示器(df)

你想要增加費用列，該列嵌套在書, 1%。更新費用列，您可以從現有的列和更新的列重建數據集，如下所示:

%scala val updated = df。selectExpr(""" named_struct('元數據'，元數據，'項目'，named_struct(' books'， named_struct('fees'， items.books.fees * 1.01)， 'paper'，項目。文件))作為named_struct """)。元數據”,“美元named_struct.items”)updated.show(假)

然後你會得到這樣的結果:

+-----------------------------------+-----------------+ | 元數據|項目  | +===================================+=================+ | [ eventid1、hostname1 timestamp1] | [[101.0], [10 ]] | +-----------------------------------+-----------------+

磚的知識庫

聯係我們