開始使用三角洲湖

使Apache火花與三角洲湖更好

邁克爾。首席軟件工程師磚
邁克爾時常要提交者和PMC成員Apache的火花,火花的原始創造者SQL。他在磚目前領導團隊,設計並建造了結構化流和磚三角洲。他在2013年獲得加州大學伯克利分校的博士學位,並建議由邁克爾·富蘭克林,大衛·帕特森,阿曼德狐狸。他的論文側重於建立係統,使開發人員能夠快速構建可伸縮的交互式應用程序,特別是規模獨立的概念定義。他的興趣廣泛包括分布式係統、大規模結構化存儲和查詢優化。

係列的細節

這次會議是開始的一部分與三角洲湖係列丹尼李和三角洲湖團隊。

會議摘要

Michael時常要加入三角洲湖工程團隊的負責人,了解他的團隊建立在Apache火花帶來ACID事務和其他數據可靠性技術從數據倉庫世界湖泊雲數據。

Apache火花是占主導地位的大數據的處理框架。三角洲湖增加了可靠性引發你的分析和機器學習計劃準備訪問質量,可靠的數據。這個網絡研討會涵蓋三角洲湖的使用引發環境來提高數據的可靠性。

主題領域包括:

Apache火花在大數據處理的作用
使用數據湖泊作為數據架構的一個重要組成部分
數據可靠性挑戰湖
三角洲湖如何幫助為電火花加工提供可靠的數據嗎
具體改進改進,三角洲湖補充道
易於采用湖三角洲湖驅動你的數據

你需要什麼:
報名參加Community Edition在這裏和訪問車間演示材料和樣本的筆記本。

視頻記錄

——(丹尼)嗨,每一個人。歡迎來到我們今天的研討會,與三角洲湖使Apache火花更好。

在我們開始今天的演講之前,我們想去一些管家的東西,確保你有最好的體驗。請注意,您的音頻連接將柔和的網絡研討會的看每個人的安慰。如果你有任何問題或疑問,請提出這些問題的問題小組或聊天。在麵板我們鼓勵你利用這段時間問盡可能多的問題和澄清任何懷疑你可能在今天的主題。我們今天的主要主持人邁克爾Armbrust,最初的創造者火花SQL和結構化流,和三角洲湖的主要創造者之一。他是首席工程師的磚,所以沒有任何進一步的延遲,邁克爾把它拿走。——(Michael)謝謝你,丹尼。今天我超級興奮,談論如何使Apache火花更好使用三角洲湖。然而,我跳進之前,我想先討論這個概念數據湖和為什麼這麼多人感到興奮,為什麼有很多的挑戰當他們試圖設置這些東西。

數據的承諾

首先,什麼是數據湖,我這是什麼意思?所以數據基本上是湖的承諾,組織有很多的數據。這可能是精心策劃的客戶數據在OLTP係統。可能是原始的點擊流來自您的web服務器,或者它可能是一種非結構化數據來自一群傳感器。湖和承諾的數據你可以把所有的數據轉儲的湖。這是真的強大當你比較傳統的數據庫,因為在傳統的數據庫中,你必須首先提出一個模式和做清潔。這通常被稱為模式寫,什麼是數據湖允許你做你可以放棄這一過程,首先收集所有因為有時你不知道為什麼直到很久以後數據是有價值的,如果你還沒有存儲它,那麼你已經失去了它。所以與數據湖,它隻是一堆文件在文件係統中。可能是S3或者HDFS Azure Blob存儲,你可以把一切都記錄下來,然後回來,以後看它。和我們的想法是當你完成,一旦你收集了這一切,然後你可以得到的見解。 You can do data science and machine learning. You can build powerful tools for your business, like recommendation engines or fraud detection algorithms. You can even do crazy things like cure cancer using genomics and DNA sequencing. However, I’ve seen this story many, many times, and typically what happens is unfortunately the data at the beginning is garbage. And so the data that you store in your data lake is garbage, and as a result, you get garbage out from these kind of more advanced processes that you try to do at the end. And why does that happen? Why is it so difficult to get quality and reliability out of these data lakes? And what does this kinda typical project look like? So I wanna walk you through a story that I’ve seen kind of happen over and over again at many organizations, when they sit down and try to extract insights from their data.

進化的尖端數據湖

它通常是這樣的,這是今天被認為是最前沿的,但是在三角洲湖。所以很常見的模式是你有一連串的事件。他們進入一些係統像Apache卡夫卡,和你的任務是做兩件事。你需要做流媒體分析,這樣你就可以實時知道發生了什麼在您的業務。你也想做人工智能和報告,你可以看看更長一段時間,進行縱向分析,實際上看有點曆史和趨勢,並對未來做出預測。所以我們要如何做呢?所以有點第一步,我坐在我的電腦,我知道火花很好的api讀取來自Apache卡夫卡。您可以使用數據幀,數據集和SQL和SQL的過程和火花做聚合和時間窗口和各種各樣的東西,並提出流分析。所以,我們開始和蝙蝠,它工作得很好,但是這給我們帶來了挑戰,這是曆史查詢。

挑戰# 1:曆史查詢?

有點得到實時分析卡夫卡是偉大的,但它隻能存儲一天或一周的數據。你不想被存儲在卡夫卡年複一年的數據。所以我們必須解決這個問題。實時很適合此時此刻發生的事情,但這是不擅長尋找曆史趨勢。所以我一直在閱讀很多博客和一個漂亮的常見模式,這裏發生了有這個東西叫做λ架構,這基本上就我可以告訴你隻是盡兩次。你有一個實時的事情是做一個近似和給你的此時此刻,到底發生了什麼,你有另一個管道可能策劃。它運行更慢一點,但它是湖歸檔所有的數據到你的數據。所以這是,這是第一步。所以如果我們想解決這個曆史查詢的問題,我們也要設置λ架構之上的香草Apache火花,一旦我有所有數據在數據湖,這個想法是我現在可以運行引發的SQL查詢,現在你可以做人工智能和報告。一點額外的工作,一些額外的協調,但幸運的是火花為批處理和流媒體統一的API。 And so it’s possible to do it and we get it set up, but that brings us to challenge number two.

挑戰# 2:混亂的數據?

就像我之前說的,現實世界中的數據往往是混亂的。一些團隊上遊從你的變化模式沒有告訴你,現在你有問題。所以有點模式,我看到的是您需要添加驗證。所以你需要寫額外的火花SQL的檢查程序,以確保你的假設是正確的數據。如果他們,如果他們錯了,發送了一封電子郵件,這樣你就可以改正它。當然,因為我們做了λ架構,我們必須做驗證在兩個不同的地方,但是,再一次,這是我們可以做的事情。我們可以用火花。現在我們建立驗證處理混亂的數據。第三,不幸的是給我們帶來了挑戰,這是錯誤和失敗。這些驗證是偉大的,但有時你忘了把一個地方或在代碼中有一個bug,甚至更難中間代碼隻是崩潰的原因你現貨實例運行在EC2和死亡等等,現在你需要擔心,“我怎麼清理,?“真正的問題,使用這些類型的分布式係統和有點分布式文件係統是如果工作崩潰在中間,結果留下垃圾,需要清理。 And so you’re kind of forced to do all of the reasoning about correctness yourself. The system isn’t giving you a lot of help here. And so a pretty common pattern is people, rather than working on an entire table at a time, because if something goes wrong, we need to recompute the entire table. They’ll instead break it up into partitions. So you have a different folder. Each folder stores a day or an hour or a week, whatever kind of granularity makes sense for your use case. And you can build a lot of, kinda scripting around it, so that it’s easy for me to do recomputation. So if one of those partitions gets corrupted for any reason, whether it was a mistake in my code or just a job failure, I can just delete that entire directory and reprocess that data for that one partition from scratch. And so kind of by building this partitioning and reprocessing engine, now I can handle these mistakes and failures. There was a little bit of extra code to write, but now I can kind of sleep safe and sound knowing that this is gonna work.

挑戰# 4:更新嗎?

這給我們帶來了挑戰4號,更新。很難做點更新。很容易添加數據,但很難改變數據在這個湖和如何正確地進行。你可能需要這樣做原因GDPR。你可能需要做保留。你可能需要做匿名化或其他的事情,或者你可能會有錯誤的數據。所以現在你必須最終寫另一個類的火花工作做更新和合並。這可以是非常困難的,通常因為它是如此困難,我看到人們做的是而不是做單獨的更新,這將是非常便宜的,他們實際上隻是任何時間他們需要做些什麼,隻要他們得到一組安全域每月一次,他們會複製整個表,刪除的人由於GDPR要求被遺忘。他們可以這樣做,但這是另一個火花工作運行。它非常昂貴。 And there’s kind of a subtlety here that makes it extra difficult, which is if you modify a table while somebody is reading it, generating a report, they’re gonna see inconsistent results and that report will be wrong. So you’ll need to be very careful to schedule this to avoid any conflicts, when you’re performing those modifications. But these are all problems that people can solve. You do this at night and you run your reports during the day or something. And so now we’ve got a mechanism for doing updates. However, the problem here is this has become really complicated. And what that means is you’re wasting a lot of time and money solving systems problems rather than doing what you really want to be doing, which is extracting value from your data. And the way I look at this is these are all distractions of the data lake that prevents you from actually accomplishing your job at hand.

數據湖分心

和的總結我認為這些是什麼,一個大一個原子性。當您運行一個分布式計算,如果工作失敗在中間,你還有一些部分結果。這不是全有或全無。原子性意味著當工作運行,它完全正確地完成或如果出了任何差錯,它完全回滾,什麼也不會發生。所以你不再離開你的數據在一個腐敗的國家需要一種沉悶地構建這些工具做手動恢複。另一個關鍵問題是沒有質量的執行。由你在每個工作手工檢查的數據的質量對你所有的假設。沒有係統的幫助,像變異在傳統的數據庫中,你可以說,“不,這列是必需的,”或“這一定是這種類型的模式。“所有的東西留給你的程序員來處理。最後沒有控製的一致性或隔離。這意味著你可以隻做一個正確的操作任何數據表,湖和使它很難混流和批處理操作,人們在閱讀。 And these are all things that you kind of, you would expect from your data storage system. You would want to be able to do these things, and people should always be able to see a consistent snapshot automatically.

所以讓我們退後一步,看看這個過程看起來與三角洲湖。

數據湖的挑戰

和三角洲湖的想法是我們把這個相對複雜的體係結構,在很多的正確性和其他東西都留給你手動編寫火花程序。我們改變了它,這樣,你的思維隻有數據流,你把所有的數據從您的組織和流程,不斷提高質量,直到準備消費。

的一個

這裏的這個體係結構的特點,首先,三角洲湖給Apache火花帶來完整的ACID事務。運行的,這意味著每一個火花工作將完成整個工作或什麼都沒有。的人閱讀和寫作的同時保證一致的快照。當寫出的東西,它絕對是寫出它不會丟失。這些都是酸的特點。這允許你關注實際的數據流,而不是思考所有這些額外的係統的問題和解決的事一遍又一遍。三角洲湖的另一個關鍵方麵是它是基於開放標準和開放源碼。這是一個完整的Apache許可,沒有愚蠢的常見條款或類似的東西。你可以把它和使用它為任何你想要的應用程序完全免費的。,就我個人而言,這將是非常重要的,如果我是存儲海量數據,對吧? Data has a lot of gravity. There’s a lot of inertia when you collect a lot of data and I wouldn’t want to put it in some black box where it’s very difficult for me to extract it. And this means that you can store that mass amount of data without worrying about lock-in. So both is it open source, but it’s also based on open standards. So I’ll talk about this in more detail later in the talk, but underneath the covers, Delta is actually storing your data in parquet. So you can read it with other engines and there’s kind of a growing community around Delta Lake building this native support in there. But worst case scenario, if you decide you want to leave from Delta Lake all you need to do is delete the transaction log and it just becomes a normal parquet table. And then finally, Delta Lake is deeply powered by Apache Spark. And so what this means is if you’ve got existing Spark jobs, whether they’re streaming or batch, you can easily convert those to getting all kinds of benefits of Delta without having to rewrite those programs from scratch. And I’m gonna talk exactly about what that looks like later in the talk. But now I want to take this picture and simplify it a little to talk about some of the other hallmarks I see of the Delta Lake architecture, and where I’ve seen people be very successful. So first of all, I wanna kind of zone in on this idea of data quality levels. These are not fundamental things of Delta Lake. I think these are things that people use a variety of systems, but I’ve seen people very successful with this pattern, alongside the features of Delta.

δ湖

所以這些隻是一種數據質量的一般類,和這裏的想法是把數據轉換成湖的數據,而不是試圖使它完美的一次,你要逐步提高數據的質量,直到它準備消費。和我將討論為什麼我認為這其實是一個非常強大的模式,可以幫助你更有效率。所以從一開始你的銅水平數據。這是一個原始數據的傾倒場所。它仍然是著火了,我覺得這是一件好事,因為這裏的核心思想是如果你抓住一切沒有做很多綠豆或解析,沒有辦法,您可以在解析和綠豆代碼錯誤。你讓一切從頭開始,實際上你可以經常保持一年值得保留。我會說一下為什麼我認為這是非常重要的,但這意味著您可以收集一切。你不需要花很多時間提前決定哪些數據會重視並不是什麼數據。你可以弄清楚,當你做分析。從青銅繼續,我們繼續的銀級數據。 This is data that is not yet ready for consumption. It’s not a report that you’re gonna give to your CEO, but I’ve already done some cleanup. I filtered out one particular event type. I’ve parsed some JSON and given it a better schema or maybe I’ve joined and augmented different data sets. I kinda got all the information I want in one place. And you might ask, if this data isn’t ready for consumption, why am I creating a table, taking the time to materialize it? And there’s actually a couple of different reasons for that. One is oftentimes these intermediate results are useful to multiple people in your organizations. And so by creating these silver level tables where you’ve taken your domain knowledge and cleaned the data up, you’re allowing them to benefit from that kind of automatically without having to do that work themselves. But a more interesting and kind of more subtle point here is it also can really help with debugging. When there’s a bug in my final report, being able to query those intermediate results is very powerful ’cause I can actually see what data produced those bad results and see where in the pipeline it made sense. And this is a good reason to have multiple hops in your pipeline. And then finally, we move on to kind of the gold class of data. This is clean data. It’s ready for consumption at business-level aggregates, and actually talk about kind of how things are running and how things are working, and this is almost ready for a report. And here you start using a variety of different engines. So like I said, Delta Lake already works very well with Spark, and there’s also a lot of interest in adding support for Presto and others, and so you can do your kind of streaming analytics and AI and reporting on it as well.

現在我想談談如何通過三角洲湖人們實際上移動數據,通過這些不同的質量類。的一個模式,我看一遍又一遍地流實際上是一個非常強大的概念。在我走得深流之前,我想正確的一些誤解,我經常聽到。所以一件事人們通常認為當他們聽到流,他們認為這應該是非常快的。它必須是非常複雜,因為你想要非常快。和火花確實支持模式的一個應用程序,該應用程序。有連續處理,你不斷地把新數據的服務器,抓住核心,支持毫秒的延遲,但實際上不是唯一的應用程序,其中流是有意義的。流對我是真正關於增量計算。它是關於一個查詢,我想持續運行新數據到來。而不是思考這一堆離散的工作,把所有的這些離散的管理工作對我或者一些工作流引擎,流媒體,走了。 You write a query once. You say, “I want to read from the bronze table, I’m gonna do these operations, I went right to the silver table,” and you just run it continuously. And you don’t have to think about the kind of complicated bits of what data is new, what data has already been processed. How do I process that data and commit it downstream transactionally? How do I checkpoint my state, so that if the job crashes and restarts, I don’t lose my place in the stream? Structured streaming takes care of all of these concerns for you. And so, rather than being more complicated, I think it can actually simplify your data architecture. And streaming in Apache Spark actually has this really nice kind of cost-latency tradeoff that you can too. So at the far end, you could use continuous processing mode. You can kind of hold onto those cores for streaming persistently, and you can get millisecond latency. In the middle zone, you can use micro-batch. And the nice thing about micro-batch is now you can have many streams on the cluster and they’re time-multiplexing those cores. So you run a really quick job and then you give up that core and then someone else comes in and runs it. And with this, you can get seconds to minutes latency. This is kind of a sweet spot for many people, ’cause it’s very hard to tell if one of your reports is up to date within the last minute, but you do care if it’s up to date within the last hour. And then finally, there’s also this thing called trigger once mode in Structured Streaming. So if you have a job where data only arrives once a day or once a week or once a month, it doesn’t make any sense to have that cluster up and running all the time, especially if you’re running in the cloud where you can give it up and stop paying for it. And Structured Streaming actually has a feature for this use case as well. And it’s called trigger once where basically rather than run the job continuously, anytime new data arrives, you boot it up. You say trigger once. It reads any new data that has arrived, processes it, commits a downstream transaction and shuts down. And so this can give you the benefits of streaming, kind of the ease of coordination, without any of the costs that are traditionally associated with an always running cluster. Now, of course, streams are not the only way to move data through a Delta Lake. Batch jobs are very important as well. Like I mentioned before, you may have GDPR or kind of these corrections that you need to make. You may have changed data capture coming from some other system where you’ve got a set of updates coming from your operational store, and you just want to reflect that within your Delta Lake and for this, we have UPSERTS. And of course, we also support just standard insert, delete, and those kinds of commands as well. And so the really nice thing about Delta Lake is it supports both of these paradigms, and you can use the right tool for the right job. And so, you can kind of seamlessly mix streaming and batch without worrying about correctness or coordination.

最後一種模式在這裏,我想談談這是重新計算的想法。所以當你有這種早期表讓你所有的原始結果,當你有很長時間保留,所以年的原始數據。和當你使用流在不同的節點之間的三角洲湖數據圖,很容易為你重新計算。你可能會想要重新計算,因為在代碼中有一個bug,或者你可能想要重新計算,因為有一些新東西,你已經決定你想提取。這裏真的很不錯的,因為這是,流的工作方式非常簡單。為了給你一個心智模型在Apache火花結構化流是如何工作的,我們的模型主要有流查詢應該始終返回相同的結果作為批處理查詢的相同數量的數據。這意味著當你開始一個新的流對三角洲表,它開始通過快照的表此刻開始。你這樣做回填操作,處理所有的數據快照,打破它分成小塊,和檢查點狀態,下遊提交它。當你得到的快照,我們切換到尾礦事務日誌,隻處理新數據查詢開始以來已經到來。,這意味著你會得到同樣的結果,好像你有運行查詢最後不管怎樣,但隨著工作顯著低於運行它從頭一遍又一遍又一遍。 So if you want to do recomputation under this model, all you need to do is clear out the downstream table, create a new checkpoint, and start it over. And it will automatically process from the beginning of time and catch up to where we are today.

實際上這是一種很強大的模式,糾正錯誤和做其他事情。現在我們已經結束後的高水平,我想談談一些特定的用例,三角洲湖方麵降低成本和寬鬆的管理使用Apache引發這些數據上的湖泊。所以δ湖,我想給一點曆史。

使用1000年代的全球組織中

所以δ湖實際上是兩歲。我們把它在過去兩年的磚。這是一個專有的解決方案,我們有我們的一些大客戶使用它。beplay体育app下载地址所以我要特別是談論Comcast,而且防暴遊戲,和果醬的城市,和英偉達,一群大的名字你知道。他們已經使用了許多年。火花峰會上大約兩個月前,我們決定開源,所以每一個人,即使那些運行在06或者在其他地方可以獲得三角洲湖的力量。所以我想談談一個特定的用例,我覺得真的很酷。這是康卡斯特。所以他們的問題在於他們有世界各地的機頂盒,為了了解人與他們交互編程,他們需要一種sessionize這個信息。你看這個電視節目,換頻道,你到這裏,你回到另一個電視節目。 And with this they can create better content by understanding how people consume it. And as you can imagine, Comcast has many subscribers, so there’s petabytes of data. And before Delta Lake, they were running this on top of Apache Spark. And the problem was the Spark job to do this sessionization was so big that the Spark job, the Spark scheduler would just tip over. And so, rather than run one job, what they actually had to do was they had to take this one job, partition it by user ID. So they kind of take the user ID, they hash it, they mod it by, I think, by 10. So they break it into kind of 10 different jobs, and then they run each of those jobs independently. And that means that there’s 10x, the overhead, in terms of coordination. You need to make sure those are all running. You need to pay for all of those instances. You need to handle failures and 10 times as many jobs, and that’s pretty complicated. And the really cool story about switching this to Delta was they were able to switch a bunch of these kinds of manual processes to streaming. And they were able to dramatically reduce their costs by bringing this down into one job, running on 1/10 of the hardware. So they’re now computing the same thing, but with 10x less overhead and 10x less cost. And so that’s a pretty kind of powerful thing here that what Delta’s scalable metadata can really bring to Apache Spark. And I’m gonna talk later in the talk exactly how that all works.

但在我進入之前,我想說,我想告訴你到底是多麼容易開始如果你已經使用Apache與三角洲湖火花。

開始使用火花與三角洲api

所以開始是微不足道的。所以它發表在火花包。所有您需要做的火花集群上安裝三角洲湖是使用火花包。如果你使用PySpark,你可以做衝刺,衝刺包然後三角洲。如果你用火花殼,同樣的事情。如果您正在構建一個Java或Scala jar,和你想要取決於三角洲,所有您需要做的就是添加一個Maven依賴項,然後改變你的代碼非常簡單。如果你使用火花SQL數據幀的讀者和作家,所有您需要做的就是更改數據源從拚花或JSON或CSV或無論你使用今天三角洲,和所有其他的應該是一樣的。唯一的區別是現在一切都將可伸縮和事務,正如我們之前看到的,可以是非常強大的。

數據質量

所以到目前為止我已經講過的一切主要是這些類型的係統問題的正確性。如果我的工作崩潰了,我不想讓它腐敗的桌子上。如果兩個人寫表在同一時間,我希望他們都看到一致的快照,但數據質量實際上是更多。您可以編寫代碼,運行正確,但是可以有一個錯誤在代碼中並得到錯誤的答案。所以這就是為什麼我們的擴展數據質量的概念允許您聲明的討論質量約束。這是工作在接下來的季度左右,但這裏的想法是我們讓你,在一個地方,指定的布局和約束三角洲湖。首先我們可以看到一些重要的事情比如數據存儲的地方。您可以打開一個嚴格的模式檢查。三角洲湖有兩種不同的模式,我經常看到人們使用他們,因為他們通過他們的數據質量的旅程。在前麵的表,您將使用模式的痕跡,也許你剛讀了一堆JSON和把它完全像三角洲湖。 We have nice tools here where we will automatically perform safe schema migrations. So if you’re writing data into Delta Lake, you can flip on the merge schema flag, and it will just automatically add new columns that appear in the data to the table, so that you can just capture everything without spending a bunch of time writing DDL. We, of course, also support kinda standard strict schema checking where you say, create table with the schema, reject any data that doesn’t match that schema, and you can use alter table to change the schema of a table. And often I see this use kind of down the road in kind of the gold level tables where you really want strict enforcement of what’s going in there. And then finally, you can register tables in the Hive Metastore. That support is coming soon, and also put human readable descriptions, so people coming to this table can see things, like this data comes from this source and it’s parsed in this way, and it’s owned by this team. These kind of extra human information that you can use to understand what data will get you the answers you want. And then finally, the feature that I’m most excited about is this notion of expectations. An expectation allows you to take your notion of data quality and actually encode it into the system. So you can say things like, for example, here, I said, I expect that this table is going to have a valid timestamp. And I can say what it means to be a valid timestamp for me and from my organization. So, I expected that the timestamp is there and I expect that it happened after 2012 because my organization started in 2012, and so if you see data from, say, 1970 due to a date parsing error, we know that’s incorrect and we want to reject it. So this is very similar to those of you who are familiar with a traditional database. This sounds a lot like a variant where you could say not null or other things on a table, but there’s kind of a subtle difference here. I think if you, so the idea of invariants are, you can say things about tables, and if one of those invariants is violated, the transaction will be aborted, will automatically fail. And I think the problem with big data, why invariants alone are not enough is if you stop processing every single time you see something unexpected, especially in those earlier bronze tables, you’re never going to process anything. And that can really hurt your agility. And so the cool thing about expectations is we actually have a notion of tuneable severity. So we do support this kind of fail stop, which you might want to use on a table that your finance department is consuming ’cause you don’t want them to ever see anything that is incorrect. But we also have these kinds of weaker things where you can just monitor how many records are valid and how many are failing to parse and alert at some threshold. Or even more powerful, we have this notion of data quarantining where you can say any record that doesn’t meet my expectations, don’t fail the pipeline, but also don’t let it go through. Just quarantine it over here in another table, so I can come and look at it later and decide what I need to do to kind of remediate that situation. So this allows you to continue processing, but without kind of corrupting downstream results with this invalid record. So like I said, this is a feature that we’re actively working on now. Stay tuned to GitHub for more work on it. But I think this kind of fundamentally changes the way that you think about data quality with Apache Spark and with your data lake.

現在我一直在高水平,δ是什麼,為什麼你在乎嗎?我想進入的細節細節δ實際上是如何運作的。因為它聽起來幾乎好得令人難以置信,我們可以將這些完整的ACID事務到像Apache火花和分布式係統仍然保持良好的性能。

三角洲在磁盤上

首先,我們先來看看一個增量表看起來像當它實際上是存儲在磁盤上。所以它會看,你有數據的湖,這應該很熟悉。它隻是一個目錄存儲在您的文件係統中,S3, HDFS, Azure Blob存儲、ADLS。它隻是一個目錄和一幫鋪文件。還有一個額外的非常重要的一點,那就是我們也存儲這個事務日誌。事務日誌裏麵,都有不同的表版本。所以,我會談談關於這些表的版本,但我們仍將數據存儲在分區目錄。然而,這實際上是主要用於調試。他們還三角洲模式,我們可以直接與存儲係統以最優的方式。舉個例子,在S3,他們建議如果你要寫大量的定期數據,而不是創建日期分區,創建時間局部性的熱點地區,而不是你的隨機散列分區,並且由於三角洲的力量的元數據,我們也可以這樣做。 And then finally, standard data files, which are just normal and coded parquet that can be read by any system out there.

表=一套行動的結果

實際上在這些表的版本是什麼?我們如何思考一個表的當前狀態是什麼嗎?所以每個表版本有一組操作,適用於表,並以某種方式改變它。和表的當前狀態,此時此刻,是所有這些行動的總和。那麼什麼樣的行動,我在說什麼?對於一個例子,我們可以改變的元數據。所以我們可以說,這是表的名稱。這是表的模式。您可以添加一個列到表什麼的。你可以設置表的分區。 So one action you can take is change the metadata. The other actions are add a file and remove a file. So we write out a parquet file, and then to actually make it visible in the table, it needs to also be added to the transaction log. And I’ll talk about why that kind of extra level of indirection is a really powerful trick in a moment. And another kind of detail here is when we add files into Delta, we can keep a lot of optional statistics about them. So in some versions we can actually keep the min and max value for every column, which we can use to do data skipping or quickly compute aggregate values over the table. And then finally you can also remove data from the table by removing the file. And again, this is kind of a lazy operation. This level of indirection is really powerful. When we remove a file from the table, we don’t necessarily delete that data immediately, allowing us to do other cool things like time travel. And so the result here of taking all these things is you end up with the current metadata, a list of files, and then also some details, like a list of transactions that have committed, the protocol version for that.

實現原子性

所以如何讓我們得到酸嗎?真正讓這些漂亮的事務數據庫的屬性?這裏一個細節是,當我們創建這些表的版本中,我們將它們存儲有序原子單元稱為提交。所以我之前談過這個問題。我們創建表的版本0通過創建這個文件,0. json。這裏的想法是當三角洲構造文件係統上的文件,我們將使用基本原子原語。所以在S3,為了保證原子性所有你要做的就是上傳係統。他們這樣做的方式是你開始上傳,說,我預計上傳這麼多字節。實際上除非你成功上傳,許多字節,S3不會接受寫。你保證你會得到整個文件或文件。 On other systems like Azure or HDFS, what we’ll do is we’ll create a temporary file with the whole contents and then we’ll do an atomic rename, so that the entire file is created or not. So then you can kind of have successive versions. So version one, we added these two files or sorry, in version zero, we added these two files. In version one, we removed them and put in a third. So for example, you could be doing compaction here where you atomically take those two files and compact them into one larger file.

確保Serializablity

現在,另一種重要的細節是我們想要為每個這些提交的原子性,但是我們也希望可串行性。我們想要每個人都同意修改這個表的順序,這樣我們就能正確地做事情像並入變化數據捕獲和其他東西需要這個屬性。為了達成這些變化即使有多個作者,我們需要這個屬性稱為互斥。如果兩個人試圖創建一個增量的相同版本表,隻有一個人能成功。為了明確這一點,用戶可以編寫零表的版本,用戶可以寫兩個版本,但是如果他們都試著寫兩個版本,其中一個就可以成功。但另一個必須得到一個錯誤消息說,對不起,您的交易沒有通過。

樂觀地解決衝突

現在你可能會說,等等,但如果任何時候兩個人做一次失敗。這聽起來像我浪費了大量的時間和大量的工作。這聽起來像是對我很大的複雜性。幸運的是,這就是我們用第三種酷技巧叫做樂觀並發。和樂觀並發的想法是對表執行一個操作時,你要樂觀地認為它會工作。如果你有一個衝突,你就看看衝突問題。如果沒有,你可以樂觀地再試一次。而且在大多數情況下,實際的交易不重疊,你可以自動矯正這些。所以給你一個具體的例子,假設有兩個用戶,這些用戶都是湧向相同的表。所以當他們開始他們的流寫,他們開始通過閱讀的版本表在那一刻。 They both read in version zero. They read in the schema of the table. So they make sure that the data that they’re appending has the correct format. And then they write some data files out for the contents of the stream that are gonna be recorded in this batch. And they record what was read and what was written from the table. Now they both try to commit, and in this case, user one wins the race and user two loses. But what user two will do is they’ll check to see if anything has changed. And because the only thing they read about the schema, of the table with the schema and the schema has not changed, they’re allowed to automatically try again. And this is all kind of hidden from you as the developer. This all happens automatically under the covers. So they’ll both try to commit, and they’ll both succeed.

處理大量的元數據

現在,最後一個技巧,我們這裏是表可以有大量的元數據。和那些試圖把數以百萬計的分區放在蜂巢Metastore可能是熟悉這個問題。它可以,一旦這些數據大小、元數據本身實際上可以使係統的東西。所以我們有一個技巧,實際上,我們已經有了一個分布式處理係統能夠處理大量的數據。我們將隻使用火花。所以我們采取的事務日誌的行動。我們讀與火花。我們可以編碼鑲花的一個檢查站。一個檢查點基本上是整個表在一些版本的狀態。當你閱讀事務日誌,而不是閱讀整個事務日誌,你可以從檢查站,然後開始任何後續之後發生了變化。 And then this itself can be processed with Spark. So when you come to a massive table that has millions of files, and you ask the question like, “How many records were added yesterday?” What we’ll do is we’ll run two different Spark jobs. The first one queries the metadata and says, “Which files are relevant to yesterday?” And it’ll get back that list of files, and then you’ll run another Spark job that actually processes them and does the count. And by doing this in two phases, we can drastically reduce the amount of data that needs to be processed. We’ll only look at the files that are relevant to the query, and we’ll use Spark to do that filtering.

路線圖

在我們結束,去之前的問題,我想談談路線圖。就像我之前說的,這個項目已經有幾年了,隻是最近開源的。我們有一個非常令人興奮的今年餘下的路線圖。基本上我們的目標是完全開源三角洲湖項目的API兼容是什麼可用的磚,所以我們的路線圖的其他季度基本上是開源很多很酷的功能。所以我們實際上,幾周前發布版本0.2.0添加支持閱讀從S3和閱讀從Azure Blob存儲和Azure數據湖。本月,我們打算做一個0.3.0釋放。加Scala api的更新、刪除、合並、和真空,Python api將在不久之後。然後剩下的這個季度,我們有幾件事情的計劃。我們想要添加完整的DDL支持,這是創建表和alter table。我們也想給你的能力來存儲蜂巢Metastore三角洲表,我認為這是非常重要的數據發現在不同的組織。 And we want to take those DML commands from before, UPDATE, DELETE, and MERGE, and actually hook them into the Spark SQL parser, so you can use standard SQL to do those operations as well. And then moving forward kind of, let us know what you want. So if you’re interested in doing more, I recommend you to check out our website at delta.io, and it has kind of a high level overview of the project. There’s a quick start guide on how you can get started, and it also has links to GitHub where you can watch the progress and see what our roadmap is, and submit your own issues on where you think the project should be going. So I definitely encourage you to do that, but with that, I think we’ll move over to questions. So let me just pull those up and see what we got.

好吧。第一個問題是材料和錄音之後可以嗎?為此,我希望丹尼可以讓我們知道。丹尼,你在這裏嗎?——(丹尼)我一點問題也沒有。是的,就像一個快速上門服務,那每個人都報名參加了研討會,我們也會發送的幻燈片和錄音。為流程大約需要12到24小時才能完成。所以你應該收到這封郵件今天晚些時候或明天早。

——(Michael)太棒了,非常感謝。是的,應該有,你可以看看這個。還有視頻在YouTube上。那麼請繼續關注更多關於三角洲湖的東西。移動到其他問題上。第一個是,三角洲湖增加性能開銷嗎?這是一個非常有趣的問題。我想要打破它。首先,三角洲湖被設計成一種高吞吐量的係統。所以每個操作,有一點點的開銷在執行它。 So you’d basically because rather than just write out the files, we need to write out the files and also write out the transaction log. So that adds a couple of seconds to your Spark job. Now, the important thing here is we designed Delta to be massively parallel and very high throughput. So you get a couple of seconds added to your Spark job, but that is mostly independent of the size of your Spark job. So what Delta Lake is really, really good at is ingesting trillions of records of data or petabytes of data or gigabytes of data. What Delta is not good at is inserting individual records. If you run one Spark job, one record per Spark job, there’ll be a lot of overhead. So kind of the trick here is you want to use Delta in the places where Spark makes the most sense, which are relatively large jobs spread out across lots of machines. And in those cases, the overhead is negligible.

接下來的問題是,因為它ACID屬性,將我的係統高可用性?這實際上是一個好的問題我想打開一點。所以δ,這是專門利用雲計算和做的,利用這些好的特性。所以對我來說,有幾個不錯的雲的屬性。一個是雲是非常穩定的。你可以把大量的數據到S3,隨意地處理它。通常漂亮的高可用性。所以你可以總是從S3讀取數據,不管你在哪裏。如果你真的,真的在乎,甚至還有諸如複製,你可以複製數據到多個地區,和δ扮演得很好。所以閱讀從三角洲表應該非常高可用性,因為它隻是底層存儲係統的可用性。 Now, those of you who are familiar with the CAP theorem might be saying, “But wait a second.” So for writes, when we think about consistency, availability, and partition tolerance, Delta chooses consistency. So we will, if you cannot talk to kind of the central coordinator, depending on whether you’re on S3, that might be kind of your own service that you’re running on Azure. They’ve taken kind of the consistency approach (indistinct) we use an atomic operation there. The system will pause. But the nice thing here is because of that kind of optimistic concurrency mechanism, that doesn’t necessarily mean you lose that whole job that you might’ve been running for hours. It just means you’ll have to wait until you’re able to talk to that service. So I would say in terms of reads, very highly available, in terms of writes, we choose consistency, but in general, that actually still works out pretty well.

接下來是你保留所有級別的數據。嗯,我認為我想澄清背後的想法,青銅,銀,金。不是每個人都讓周圍的原始數據。不是每個人都讓所有的數據。您可能有一個保留要求說你隻允許保留兩年的數據。所以,我認為這是由你來決定哪些數據是有意義的堅持。我唯一想說的是我認為優點數據湖泊和三角洲如何適用於他們一般是你有權保持原始數據和盡可能多的你想要的。所以,沒有技術上的限製讓你保持所有的數據,因此,許多和我一起工作的組織實際上把所有他們在法律上被允許保持很長一段時間。隻有刪除它當他們不得不擺脫它。

下一個問題是你寫邏輯?我們可以用Scala編寫邏輯嗎?所以δ湖插入所有現有的api,所以Apache火花,這意味著你可以使用任何的人。如果你是一個Scala程序員,您可以使用Scala。如果你是一個Java程序員,工作。我們都也綁定在Python中,如果你的分析師和你不想計劃,我們也支持純SQL。這裏真的有點我們的想法是底層引擎用Scala編寫和δ也用Scala編寫的,但你的邏輯可以用您熟悉的語言寫的。這是另一個的情況下,我認為你需要正確的工具,正確的工作。所以就我個人而言,我做很多我的東西在Scala中,但當我需要做出圖表,我切換到Python和使用該平台。Beplay体育安卓版本但仍然三角洲給我能力的過濾器通過大量的數據,把它縮小東西會適合熊貓,然後我做一些繪圖。

下一個問題是,轉眼間三角洲湖的一部分還是隻有火花嗎?這是一個很好的問題。這是現在發展的很快。這是兩種不同的答案。所以我要告訴你我們在和我們去的地方。現在,裏麵的一個特性的磚,我們致力於開源,它允許您為三角洲,作家寫出這些東西稱為manifest文件允許您查詢一個增量表一致的方式從轉眼間雅典娜或任何其他基於轉眼間的係統。然而,我們正在深入的亮光,一轉眼間背後的公司,建立一個本地連接器很快。我們也有活躍的利益從蜂巢社區和滾燙的社區,所以有很多興趣建立連接器。所以今天,三角洲的核心是建於火花,但我認為真正強大的開源和開放標準,這意味著任何人都可以集成。而且,這個項目我們致力於發展生態係統和使用任何人。 So if you’re a committer on one of those projects, please join our mailing list, join our Slack channel, check it out, and let us know how we can help you build these additional connectors.

下一個問題,我們可以嚐試三角洲湖community edition的磚嗎?是的,你可以。三角洲湖在community edition,檢查出來。一切都應該有。讓我們知道你的想法。

下一個問題是,和蜂巢三角洲表可以查詢嗎?是的,所以基本上相同的回答很快。有活躍的興趣社區建設這種支持。今天沒有,但絕對是我們想構建。下一題,三角洲湖是怎樣處理緩慢變化維度從原始到黃金?

是的,很好,這是個好問題,實際上是一個博客在www.eheci.com上。如果你穀歌緩慢變化維度,三角洲,它將引導您完成所有的細節,但我認為真正正確的答案是合並操作符和加火花的權力,它實際上很容易構建的所有不同類型的緩慢變化維度。三角洲,神奇的一件事是添加的火花,使這些交易。修改一個表會非常危險沒有交易和δ使這成為可能,從而使這種類型的用例。

下一個,我們通常處理Azure。我們想知道是否三角洲湖有什麼不同的行為在Azure上運行時事件中心而不是卡夫卡。是的,我將回答這個問題有點更普遍。所以我認為,我談到了一個強大的三角洲被集成的火花。的一大原因是我使用火花的瘦浪費大數據的生態係統。有火花連接器世界上幾乎每一個大數據係統。所以如果火花可以讀取它,它與三角洲湖。所以事件中心,特別是,既有本地連接器,插頭通過火花數據源和也有卡夫卡API與火花卡夫卡作品。所以你可以很容易地讀取事件中心和做所有的東西我今天談到使用事件中心而不是卡夫卡。,適用於任何係統,火花可以讀取。

一般,隻是有點回答Azure一點,δ在Azure上完全支持,包括ADLS。我們最近的改進支持ADLS,創兩個。可以為您下載,這也是Azure磚的一部分的。

是的,那麼接下來的問題是究竟是Scala API的DML命令更新嗎?答案是,它看起來像火花的SQL,在火花SQL和你傳遞一個字符串,更新嗎?答案是,我們會同時支持。所以如果你真的去GitHub庫,我相信這段代碼已經被合並。所以你可以看到Scala API,如果沒有,有一個設計文檔,談到了在添加一個更新的機票細節。但這裏的想法是,都將是一個Scala的函數叫做更新,你可以使用編程方式,而不必做字符串插值,還有一種SQL的方法。所以你可以創建一個SQL字符串,並傳遞。所以,這就像,你用你最熟悉的語言已經工具包的一部分,自動和δ應與這樣的合作。

下一個問題是,三角洲湖與HDFS工作嗎?是的,這完全與HDFS。HDFS有我們需要的所有原語,所以你不需要任何額外的細節,和我談論HDFS支持原子重命名失敗如果目的地已經存在。隻要你足夠運行一個新的版本的HDFS,這甚至不是新的,應該自動工作。如果你看看入門指南文檔在δ。io,所有不同的存儲係統,我們支持和細節你需要做什麼設置。

下一個問題,更新,刪除在單行或記錄水平?有兩個答案。是的,δ允許你做的細粒度,個人行更新。所以你不一定要做你的更新或刪除在分區級別。如果你在分區級別,他們是重要的。如果你刪除,例如,在分區級別,這是更有效率,因為我們可以將元數據。我們不需要做任何手動修改。但是如果他們不是在分區級別,如果你做一個細粒度的單行更新或刪除,我們要做的是我們會找到相關的拚花文件,修改,提交添加和刪除操作發生,然後它的事務。所以它支持它,但它確實涉及重寫個人文件。所以我在這裏要說的是,達美航空的絕對不是設計成一個OLTP係統。 You should not use it if you have lots of individual row updates, but we do support that fine granularity use case.

你知道什麼時候三角洲湖的Scala api可以嗎?嗯,有幾個答案。所以δ閱讀和寫作和湖流和批處理已經在Scala中工作。今天是可用的。如果你專門談論更新、刪除和合並,我相信大部分的代碼已經,已經放入存儲庫。如果你下載和構建它自己,它的存在。我們希望7月發布。希望這個月,會有下一個版本,包含這些額外的Scala api。

讓我們來看看。

是的,所以下一個問題是關於數據質量。我們可以有任何其他字段進行驗證的目的除了時間戳?是的,所以我們之前談到的預期,隻是一般的SQL表達式。所以任何期望,您可以在SQL編碼是允許的。所以它可能是,在這個示例中,它是一個非常簡單的比較操作的一些具體日期,但它可以是任何你想要的。它甚至可以是一個UDF,檢查數據的質量。所以重要的是,我們隻是讓你把這些作為屬性的數據流,而不是手動驗證你自己記住要做的事。這樣的執行全球各地任何人使用該係統。

三角洲湖是否支持合並的數據幀而不是臨時表?是的,所以一旦Scala和Python api可用,你可以通過一個數據幀。今天在磚,唯一可用的SQL DML,在那種情況下,你需要注冊一個臨時表。但就像我說的,請繼續關注本月底。我們會有一個釋放,Scala api,然後你就可以通過自己在數據幀。

我見過幾次這個問題,所以我隻回答一次。我們支持ADLS創一創兩個,雖然創兩個要快,因為我們有一些額外的優化。

下一個是在檢查點的例子中,是引發工作計算三角洲湖檢查點內部或需要手寫嗎?這是個很好的問題。當你使用流讀取,或寫一個三角洲表或者兩者都是,如果你隻是使用它在兩個不同的三角洲表、檢查點是由結構化流。所以你不需要做任何額外的工作來構造檢查站。內置的引擎。火花的結構化流的工作方式是每一個源和同步,有一個合同,允許我們的做自動檢查點。所以源需要能夠說,我從這裏到這裏,處理數據和概念的地方流,我們稱之為補償,這些需要是可序列化的。我們商店的檢查站。我們基本上使用檢查點之前寫日誌。所以我們說,批號10將是這些數據。 Then we attempt to process batch number 10, then we write it to the sync, and the guarantee here is the sync must be idempotent. So it must only accept batch number 10 once, and if we try to write it twice due to a failure, it must reject that and kind of just skip over it. And by putting all of these kind of constraints together, you actually get exactly once processing with automatic checkpointing without you needing to do any extra work.

偉大的問題。為什麼不使用通曉多種語言的持久性和使用RDBMS存儲資產交易?這是一個很棒的問題,我們嚐試這一點。事實上,δ使用MySQL的早期版本之一,這裏的問題是MySQL是一個機器,所以剛剛的列表文件為一個大桌子可以成為瓶頸。而當你將元數據存儲在一個表單,火花本身可以本地過程中,您可以利用火花加工。所以沒有什麼阻止你實現三角洲的事務協議的存儲係統。事實上,有一個很長的談話現在在GitHub庫,這是有點什麼來回需要建立基礎數據庫版本的三角洲,這當然是可能的,但在我們最初的可伸縮性測試,我們發現火花是最快的方法,至少從我們測試的係統,這就是為什麼我們決定這樣做。

另一個問題,這是否意味著我們不需要數據幀和可以做所有轉換三角洲湖呢?我說不。我認為你隻能使用更新,刪除和合並不使用任何類型的實際的數據幀代碼。您可以使用純SQL,但實際上,我認為這是一種合適的工具做合適的工作。三角洲湖深深地集成引發數據幀。,就我個人而言,我覺得這是一個非常強大的工具進行轉換。它有點像SQL + +因為你有這些關係的概念,但嵌入在一個完整的編程語言。其實我覺得可以是一個非常有效的方式寫你的數據管道。

三角洲湖如何管理引發的較新版本?是的,所以三角洲湖需要2.4.3火花,這是一個非常最近的版本。因為有蟲子在火花的早期版本中,阻止數據源正確插進去。所以,總的來說,我們正在努力火花兼容性。這是本季度我們的核心項目之一是確保一切三角洲插入好公共穩定的api的火花,所以我們在未來可以處理多個版本。

一個問題,三角洲湖是否支持獸人?是的,這是一個很好的問題,我得到很多。再次,討論GitHub添加支持。我鼓勵你去檢查一下,在這個問題上投票,如果這是對你很重要的東西。這個問題有兩種答案。一個是三角洲湖事務協議。事務日誌的東西實際上確實支持指定的格式存儲的數據。所以實際上可以用於任何不同的文件格式,txt, JSON, CSV。這是內置的協議了。今天,我們不公開,作為一個選擇。 When you’re creating a Delta table, we only do parquet. And the reason for that is pretty simple. I just think less tuning knobs is generally better, but for something like ORC, if there’s a good reason why your organization can switch, I think that support would be really, really easy to add and that’s something that we’re discussing in the community. So please go over to GitHub, find that issue, and fill it in. And then I’m going to take one final question since we’re getting close to time. And the question here is, what is the difference between the Delta Lake that’s included with Databricks versus the open source version? And that’s a question I get a lot. And I think, the way to think about this is I’d like to kind of talk about what my philosophy is behind open source. And that is that I think APIs in general need to be open. So any program you can run correctly inside of Databricks should also work in open source. Now that’s not entirely true today because Delta Lake is only, the open source version of Delta Lake is only two months old. And so what we’re doing is we are working hard to open source all of the different APIs that exist. So update, delete, merge, history, all of those kinds of things that you can do inside of Databricks will also be available in the open source version. Managed Delta Lake is the version that we provide. It’s gonna be easier to set up. It’s gonna integrate with all of the other pieces of Databricks. So we do caching, we have a kind of significantly faster version of Spark, and so that runs much faster, but in terms of capabilities, our goal is for there to be kind of complete feature parity here ’cause we’re kinda committed to making this open source project successful. I think open APIs is the correct way to do that. So with that, I think we’ll end it. Thank you very much for joining me today. And please check out the website, join the mailing list…

高級:深入三角洲湖

潛水通過三角洲湖的內部,一個流行的開源技術支持ACID事務,執行時間旅行、模式和更多的數據之上的湖泊。

看現在