educative.io

Why creating a view in spark

“To work through this process in PySpark, we’ll load the stats dataset into a dataframe, expose it as a view, and calculate the summary statistics.”
Why is a view needed?
And in this chapter, we read the csv file into a data frame, does that mean it will not use the lazy load?


Type your question above this line.

Course: https://www.educative.io/collection/10370001/6068402050301952
Lesson: https://www.educative.io/collection/page/10370001/6068402050301952/6308607927779328

Hi @Y_C ,

We are reading dataset from a csv file and then making exposing it as a view because as we know that a view is the result set of a stored query on the data. So whenever we need to show the specific data retrieved from “select player_id, sum(1) as games, sum(goals) as goals
from stats
group by 1
order by 3 desc
limit 5”
query, we will only have to call the view by calling “display(view_name)” rather than fetching the data from the table again via running the same query,

the data frame was loaded in memory already, why do we need to create a view on top of that?

yes, you are right, but the data frame contains the whole data. We only need the data retrieved from the “stats” table.

okay, thanks for the explanation. But why in this case we use read csv instead of the other format so we could utilise lazy loading?

The file we are reading is already in csv format.