Using DSL for Spark Dataset

Zhongkai_Liu · February 13, 2022, 8:59pm

If I always use DSL instead of lambda for Spark Dataset, would I be able to completely mitigate the performance effect of using Dataset compared to using DataFrame?

Type your question above this line.

Course: https://www.educative.io/collection/5352985413550080/6639962691731456
Lesson: https://www.educative.io/collection/page/5352985413550080/6639962691731456/6696332884967424

Abdul_Mateen · February 17, 2022, 12:42pm

Hello @Zhongkai_Liu

Yes, you are right; if you use DSL instead of lambda for Spark Dataset, you would be able to completely mitigate the performance effect of using Spark Dataset compared to using Spark Dataframe.

But in general, the Spark Dataframes are faster than the Spark Datasets.

I hope I have answered your query; please let me know if you still have any confusion.

Thank You

Zhongkai_Liu · February 21, 2022, 2:06am

Thank you @Abdul_Mateen for your answer.

If using DSL can completely mitigate the performance effect, why would we still say DataFrames are generally faster? Are there other things that cause Datasets slower?

Currently, we mainly use DataFrames in our codebase. However, I have been considering using more Datasets, as I prefer strongly typed and compile-time checks. But I don’t want to affect the performance too much. I am okay with always using DSL instead of lambda, since we are already using DSL for DataFrames.