IT Talks/BigData

Pandas Data Frame vs Spark DataFrame

OJJ 2023. 3. 16. 11:13

DataFrame ์€ ํ–‰๊ณผ ์—ด์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํ…Œ์ด๋ธ”์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, DataFrame ๊ฐœ๋…์€ ์–ด๋–ค ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์—์„œ๋„ ๋ณ€ํ•˜์ง€ ์•Š์ง€๋งŒ Spark ์™€ Pandas ์˜ DataFrame ์€ ์ƒ๋‹นํžˆ ๋‹ค๋ฅด๋‹ค. ์ด ๊ธ€์—์„œ๋Š” Spark DataFrame๊ณผ Pandas DataFra,e์˜ ์ฐจ์ด์ ์„ ์•Œ์•„๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Pandas DataFrame

Panda๋Š” NumPy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์˜คํ”ˆ ์†Œ์Šค Python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์™€ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์น˜ ๋ฐ์ดํ„ฐ์™€ ์‹œ๊ณ„์—ด์„ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” Python ํŒจํ‚ค์ง€์ž…๋‹ˆ๋‹ค. ์ฃผ๋กœ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ ๋ฐ ๋ถ„์„์„ ์ƒ๋‹นํžˆ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. Panda DataFrame์€ ๋ ˆ์ด๋ธ”์ด ์ง€์ •๋œ ์ถ•(ํ–‰ ๋ฐ ์—ด)์„ ๊ฐ€์ง„ ์ž ์žฌ์ ์œผ๋กœ ์ด์งˆ์ ์ธ 2์ฐจ์› ํฌ๊ธฐ ๊ฐ€๋ณ€ ํ‘œ ํ˜•์‹ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.๋ฐ์ดํ„ฐ, ํ–‰ ๋ฐ ์—ด์€ Panda DataFrame์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ์ž…๋‹ˆ๋‹ค.

์žฅ์ :

  • Panda Dataframe์€ ์ธ๋ฑ์‹ฑ, ์ด๋ฆ„ ๋ณ€๊ฒฝ, ์ •๋ ฌ, ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ณ‘ํ•ฉ ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์กฐ์ž‘์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • Panda๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ด์„ ์—…๋ฐ์ดํŠธ, ์ถ”๊ฐ€ ๋ฐ ์‚ญ์ œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Panda Dataframe์€ ์—ฌ๋Ÿฌ ํŒŒ์ผ ํ˜•์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‚ด์žฅ๋œ ๊ธฐ๋Šฅ์œผ๋กœ ์ธํ•ด ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ๊น๋‹ˆ๋‹ค. (-> ์ด๊ฒŒ ์žฅ์ ??)

๋‹จ์ :

  • ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์กฐ์ž‘์ด ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค.
  • ์กฐ์ž‘ ์ค‘์—๋Š” ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

Spark DataFrame

Spark๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ปดํ“จํŒ…์šฉ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ํด๋Ÿฌ์Šคํ„ฐ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ(์˜ˆ: Hadoop)๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์†๋„๊ฐ€ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค. Python, Scala ๋ฐ Java์˜ ๊ณ ๊ธ‰ API๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Spark์—์„œ๋Š” ๋ณ‘๋ ฌ ์ž‘์—…์„ ์“ฐ๋Š” ๊ฒƒ์ด ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค. Spark๋Š” ํ˜„์žฌ ๊ฐ€์žฅ ํ™œ๋ฐœํ•œ Apache ํ”„๋กœ์ ํŠธ๋กœ ๋‹ค์ˆ˜์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ฒ˜๋ฆฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Spark๋Š” Scala๋กœ ์ž‘์„ฑ๋˜๋ฉฐ Python, Scala, Java ๋ฐ R์—์„œ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Spark์—์„œ DataFrames๋Š” ํ–‰๊ณผ ์—ด๋กœ ๊ตฌ์„ฑ๋œ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ž…๋‹ˆ๋‹ค. DataFrame์˜ ๊ฐ ์—ด์—๋Š” ์ด๋ฆ„๊ณผ ์œ ํ˜•์ด ์ง€์ •๋ฉ๋‹ˆ๋‹ค.

์žฅ์ :

  • Spark๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ์กฐ์ž‘์— ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฌ์šด API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • 'MAP'์™€ 'reduce', ๋จธ์‹ ๋Ÿฌ๋‹(ML), ๊ทธ๋ž˜ํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ, SQL ์ฟผ๋ฆฌ ๋“ฑ์„ ์ง€์›ํ•œ๋‹ค.
  • ์ŠคํŒŒํฌ๋Š” ๊ณ„์‚ฐ์— ๋ฉ”๋ชจ๋ฆฌ ๋‚ด(RAM)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋ณ‘๋ ฌ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ๋˜๋ก 80๊ฐœ์˜ High-level Operators ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

๋‹จ์ :

  • ์ž๋™ ์ตœ์ ํ™” ํ”„๋กœ์„ธ์Šค ์—†์Œ
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค.
  • ์ž‘์€ ํŒŒ์ผ ๋ฌธ์ œ

 

Spark DataFrame๊ณผ Panda DataFrame์˜ ์ฐจ์ด์  ํ‘œ:

Spark DataFrame Pandas DataFrame
Spark Data Frame์€ ๋ณ‘๋ ฌํ™”๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Panda Data Frame์€ ๋ณ‘๋ ฌํ™”๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
Spark Data Frame์—๋Š” ์—ฌ๋Ÿฌ ๋…ธ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. Panda Data Frame์—๋Š” ๋‹จ์ผ ๋…ธ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ๋ช…๋ น์–ด๋Š” ์ž‘์—…์ด ์ˆ˜ํ–‰๋  ๋•Œ๊นŒ์ง€ ์ž‘์—…์ด ์‹คํ–‰๋˜์ง€ ์•Š์Œ์„ ์˜๋ฏธํ•˜๋Š” Lazeagy Execution ๋’ค์— ๋‚˜์˜ต๋‹ˆ๋‹ค. ์ด ๋ช…๋ น์–ด๋Š” ์ž‘์—…์ด ์ฆ‰์‹œ ์‹คํ–‰๋œ๋‹ค๋Š” ์˜๋ฏธ์ธ Eager Execution์— ์ด์–ด ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.
Spark Data Frame์€ ๋ถˆ๋ณ€์ž…๋‹ˆ๋‹ค. Panda Data Frame์€ Mutable (๋ณ€ํ•  ์ˆ˜ ์žˆ๋Š”)์ž…๋‹ˆ๋‹ค.
๋ณต์žกํ•œ ์ž‘์—…์€ Panda Data Frame์— ๋น„ํ•ด ์ˆ˜ํ–‰ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. Spark Data Frame์— ๋น„ํ•ด ๋ณต์žกํ•œ ์ž‘์—…์€ ์ˆ˜ํ–‰ํ•˜๊ธฐ๊ฐ€ ๋” ์‰ฝ์Šต๋‹ˆ๋‹ค.
Spark DataFrame์€ ๋ถ„์‚ฐ๋˜๋ฏ€๋กœ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์†๋„๊ฐ€ ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค. Panda DataFrame์€ ๋ถ„์‚ฐ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด Panda DataFrame์˜ ์ฒ˜๋ฆฌ ์†๋„๊ฐ€ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.
sparkDataFrame.count()๋Š” ํ–‰ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. pandaDataFrame.count()๋Š” ๊ฐ ์—ด์— ๋Œ€ํ•ด null์ด ์•„๋‹Œ ๊ด€์ฐฐ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
Spark Data Frame์€ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ๋งค์šฐ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. Panda DataFrames๋Š” ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
Spark Data Frame์€ ๋‚ด๊ฒฐํ•จ์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. Panda Data Frame์€ ๋‚ด๊ฒฐํ•จ์„ฑ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.์šฐ๋ฆฌ๋Š” ๊ทธ๊ฒƒ์„ ํ™•์‹คํžˆ ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ ์ž์‹ ์˜ ํ‹€์„ ๊ตฌํ˜„ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

Pandas ์™€ Spark ์ค‘์— ์„ ํƒ

PySpark๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด Panda์— ๋น„ํ•ด ๊ฐ–๋Š” ๋ช‡ ๊ฐ€์ง€ ์žฅ์ ์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๋ฉด Pandas์˜ ๋™์ž‘์ด ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ์ง€๋งŒ Spark์—๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์ž‘ํ•˜๊ธฐ ์œ„ํ•œ API๊ฐ€ ๋‚ด์žฅ๋˜์–ด ์žˆ์–ด Pandas ๋ณด๋‹ค ์†๋„๊ฐ€ ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • Spark๋Š” Pandas๋ณด๋‹ค ๊ตฌํ˜„์ด ์šฉ์ดํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฌ์šด API๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • Spark๋Š” Python, Scala, Java ๋ฐ R์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • Spark์˜ ANSI SQL ํ˜ธํ™˜์„ฑ.
  • Spark๋Š” ๊ณ„์‚ฐ์— ๋ฉ”๋ชจ๋ฆฌ ๋‚ด(RAM)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 


(์›๋ฌธ)

Dataframe represents a table of data with rows and columns, Dataframe concepts never change in any Programming language, however, Spark Dataframe and Pandas Dataframe are quite different. In this article, we are going to see the difference between Spark dataframe and Pandas Dataframe.

Pandas DataFrame

Pandas is an open-source Python library based on the NumPy library. It’s a Python package that lets you manipulate numerical data and time series using a variety of data structures and operations. It is primarily used to make data import and analysis considerably easier. Pandas DataFrame is a potentially heterogeneous two-dimensional size-mutable tabular data structure with labeled axes (rows and columns). The data, rows, and columns are the three main components of a Pandas DataFrame.

Advantages:

  • Pandas Dataframe able to Data Manipulation such as indexing, renaming, sorting, merging data frame.
  • Updating, adding, and deleting columns are quite easier using Pandas.
  • Pandas Dataframe supports multiple file formats
  • Processing Time is too high due to the inbuilt function.

Disadvantages:

  • Manipulation becomes complex while we use a Huge dataset.
  • Processing time can be slow during manipulation.

Spark DataFrame

Spark is a system for cluster computing. When compared to other cluster computing systems (such as Hadoop), it is faster. It has Python, Scala, and Java high-level APIs. In Spark, writing parallel jobs is simple. Spark is the most active Apache project at the moment, processing a large number of datasets. Spark is written in Scala and provides API in Python, Scala, Java, and R. In Spark, DataFrames are distributed data collections that are organized into rows and columns. Each column in a DataFrame is given a name and a type.

Advantages:

  • Spark carry easy to use API for operation large dataset.
  • It not only supports ‘MAP’ and ‘reduce’, Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc.
  • Spark uses in-memory(RAM) for computation.
  • It offers 80 high-level operators to develop parallel applications.

Disadvantages:

  • No automatic optimization process
  • Very few Algorithms.
  • Small Files Issue

Table of Difference between Spark DataFrame and Pandas DataFrame:

Spark DataFrame Pandas DataFrame
Spark DataFrame supports parallelization.  Pandas DataFrame does not support parallelization. 
Spark DataFrame has Multiple Nodes. Pandas DataFrame has a Single  Node.
It follows Lazy Execution which means that a task is not executed until an action is performed. It follows Eager Execution, which means task is executed immediately.
Spark DataFrame is Immutable. Pandas DataFrame is Mutable.
Complex operations are difficult to perform as compared to Pandas DataFrame. Complex operations are easier to perform as compared to Spark DataFrame.
Spark DataFrame is distributed and hence processing in the Spark DataFrame is faster for a large amount of data. Pandas DataFrame is not distributed and hence processing in the Pandas DataFrame will be slower for a large amount of data.
sparkDataFrame.count() returns the number of rows. pandasDataFrame.count() returns the number of non NA/null observations for each column.
Spark DataFrames are excellent for building a scalable application. Pandas DataFrames can’t be used to build a scalable application.
Spark DataFrame assures fault tolerance. Pandas DataFrame does not assure fault tolerance. We need to implement our own framework to assure it.

Deciding Between Pandas and Spark

Let’s see few advantages of using PySpark over Pandas

 
  • When we use a huge amount of datasets, then pandas can be slow to operate but the spark has an inbuilt API to operate data, which makes it faster than pandas.
  • Easier to implement than pandas, Spark has easy to use API.
  • Spark supports Python, Scala, Java & R
  • ANSI SQL compatibility in Spark.
  • Spark uses in-memory(RAM) for computation.

 

* ์ถœ์ฒ˜ : https://www.geeksforgeeks.org/difference-between-spark-dataframe-and-pandas-dataframe/

๋ฐ˜์‘ํ˜•