ProtoSerializer stack overflow error in DBConnect

A stack overflow error in DBConnect indicates that you need to allocate more memory on the local PC.

Written byashritha.laxminarayana

Last published at: May 9th, 2022

Problem

You are using DBConnect (AWS|Azure|GCP) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error.

py4j.protocol.Py4JJavaError: An error occurred while calling o945.count. : java.lang.StackOverflowError at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Class.getEnclosingClass(Class.java:1272) at java.lang.Class.getSimpleBinaryName(Class.java:1443) at java.lang.Class.getSimpleName(Class.java:1309) at org.apache.spark.sql.types.DataType.typeName(DataType.scala:67) at org.apache.spark.sql.types.DataType.simpleString(DataType.scala:82) at org.apache.spark.sql.types.DataType.sql(DataType.scala:90) at org.apache.spark.sql.util.ProtoSerializer.serializeDataType(ProtoSerializer.scala:3207) at org.apache.spark.sql.util.ProtoSerializer.serializeAttrRef(ProtoSerializer.scala:3610) at org.apache.spark.sql.util.ProtoSerializer.serializeAttr(ProtoSerializer.scala:3600) at org.apache.spark.sql.util.ProtoSerializer.serializeNamedExpr(ProtoSerializer.scala:3537) at org.apache.spark.sql.util.ProtoSerializer.serializeExpr(ProtoSerializer.scala:2323) at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:3001) at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:2998)

Performing the same operation in a notebook works correctly and does not produce an error.

Example code

You can reproduce the error with this sample code.

It creates a DataFrame with 200 columns and renames them all.

This sample code runs correctly in a notebook, but results in an error when run in DBConnect.

% python df = spark.createDataFrame ([{str (i):我佛r i in range(2000)}]) df = spark.createDataFrame([{str(i) : i for i in range(200)}]) for col in df.columns: df = df.withColumnRenamed(col, col + "_a") df.collect()

Cause

When you run code in DBConnect, some functions are handled on the remote cluster driver, but some are handled locally on the client PC.

If enough memory is not allocated on the local PC, you get an error.

Solution

You should increase the memory allocated to the Apache Spark driver on the local PC.

  1. Rundatabricks-connect get-spark-homeon your local PC to get the${spark_home}value.
  2. Navigate to the${spark_home}/conf/folder.
  3. Open thespark-defaults.conffile.
  4. Add the following settings to thespark-defaults.conffile:
    spark.driver.memory 4g spark.driver.extraJavaOptions -Xss32M
  5. Save the changes.
  6. Restart DBConnect.
Delete

Warning

DBConnect only works with supported Databricks Runtime versions. Ensure that you are using a supported runtime on your cluster before using DBConnect.



Was this article helpful?