Why creating a view in spark

Y_C · March 23, 2022, 7:24am

“To work through this process in PySpark, we’ll load the stats dataset into a dataframe, expose it as a view, and calculate the summary statistics.”
Why is a view needed?
And in this chapter, we read the csv file into a data frame, does that mean it will not use the lazy load?

Type your question above this line.

Course: https://www.educative.io/collection/10370001/6068402050301952
Lesson: https://www.educative.io/collection/page/10370001/6068402050301952/6308607927779328

Ammar_Ahmad_Farid · March 27, 2022, 8:30pm

Hi @Y_C ,

We are reading dataset from a csv file and then making exposing it as a view because as we know that a view is the result set of a stored query on the data. So whenever we need to show the specific data retrieved from “select player_id, sum(1) as games, sum(goals) as goals
from stats
group by 1
order by 3 desc
limit 5”
query, we will only have to call the view by calling “display(view_name)” rather than fetching the data from the table again via running the same query,

Y_C · March 28, 2022, 10:37am

the data frame was loaded in memory already, why do we need to create a view on top of that?

Ammar_Ahmad_Farid · March 28, 2022, 11:17am

yes, you are right, but the data frame contains the whole data. We only need the data retrieved from the “stats” table.

Y_C · March 28, 2022, 11:20am

okay, thanks for the explanation. But why in this case we use read csv instead of the other format so we could utilise lazy loading?

Ammar_Ahmad_Farid · March 28, 2022, 11:36am

The file we are reading is already in csv format.