Problem
You are using DBConnect (AWS|Azure|GCP) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error.
py4j.protocol.Py4JJavaError: An error occurred while calling o945.count. : java.lang.StackOverflowError at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Class.getEnclosingClass(Class.java:1272) at java.lang.Class.getSimpleBinaryName(Class.java:1443) at java.lang.Class.getSimpleName(Class.java:1309) at org.apache.spark.sql.types.DataType.typeName(DataType.scala:67) at org.apache.spark.sql.types.DataType.simpleString(DataType.scala:82) at org.apache.spark.sql.types.DataType.sql(DataType.scala:90) at org.apache.spark.sql.util.ProtoSerializer.serializeDataType(ProtoSerializer.scala:3207) at org.apache.spark.sql.util.ProtoSerializer.serializeAttrRef(ProtoSerializer.scala:3610) at org.apache.spark.sql.util.ProtoSerializer.serializeAttr(ProtoSerializer.scala:3600) at org.apache.spark.sql.util.ProtoSerializer.serializeNamedExpr(ProtoSerializer.scala:3537) at org.apache.spark.sql.util.ProtoSerializer.serializeExpr(ProtoSerializer.scala:2323) at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:3001) at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:2998)
Performing the same operation in a notebook works correctly and does not produce an error.
Example code
You can reproduce the error with this sample code.
It creates a DataFrame with 200 columns and renames them all.
This sample code runs correctly in a notebook, but results in an error when run in DBConnect.
% python df = spark.createDataFrame ([{str (i):我佛r i in range(2000)}]) df = spark.createDataFrame([{str(i) : i for i in range(200)}]) for col in df.columns: df = df.withColumnRenamed(col, col + "_a") df.collect()
Cause
When you run code in DBConnect, some functions are handled on the remote cluster driver, but some are handled locally on the client PC.
If enough memory is not allocated on the local PC, you get an error.
Solution
You should increase the memory allocated to the Apache Spark driver on the local PC.
- Rundatabricks-connect get-spark-homeon your local PC to get the${spark_home}value.
- Navigate to the${spark_home}/conf/folder.
- Open thespark-defaults.conffile.
- Add the following settings to thespark-defaults.conffile:
spark.driver.memory 4g spark.driver.extraJavaOptions -Xss32M
- Save the changes.
- Restart DBConnect.