解決:閱讀多個csv文件使用pathos.multiproce…-磚- 13582

Prototype998 · ‎07-13-2022

我用PySpark和痛苦讀大量的CSV文件和創建許多DF,但我繼續這個問題。

相同的代碼:-

從感傷。多處理進口ProcessingPool

def readCsv(路徑):

返回spark.read.csv(路徑,頭= True)

csv_file_list =[文件[0][5:]dbutils.fs.ls的文件(“/ databricks-datasets / COVID / coronavirusdataset /”)如果文件[1].endswith (. csv)]

池= ProcessingPool (2)

結果=池。地圖(readCsv csv_file_list)

Rishabh264 · ‎12-22-2022

嘿@Punit Chauhan引用這段代碼

多處理。池進口ThreadPool池= ThreadPool(5)筆記本= [‘dim_1’,‘dim_2]池。地圖(λ路徑:dbutils.notebook.run(+路徑“/測試/線程”,timeout_seconds = 60,參數={}“輸入數據”:路徑),筆記本電腦)

AmanSehgal · ‎07-14-2022

你真的不需要過濾的。這樣的csv文件。

您可以使用“pathGlobFilter”做一個正則表達式匹配提供了正則表達式的匹配選擇文件。

df = spark.read.option (“pathGlobFilter”、“* . csv”) . csv (upload_path)

Vidula · ‎09-04-2022

嗨@Punit Chauhan

希望一切都好!隻是想檢查如果你能解決你的問題,你會很高興分享解決方案或答案標記為最佳?其他的請讓我們知道如果你需要更多的幫助。

我們很想聽到你的聲音。

謝謝!

Prototype998 · ‎12-22-2022

@Ajay Pandey @Rishabh Pandey

Rishabh264 · ‎12-22-2022

嘿@Punit Chauhan引用這段代碼

多處理。池進口ThreadPool池= ThreadPool(5)筆記本= [‘dim_1’,‘dim_2]池。地圖(λ路徑:dbutils.notebook.run(+路徑“/測試/線程”,timeout_seconds = 60,參數={}“輸入數據”:路徑),筆記本電腦)

磚