PySpark vs Pandas

Spark and Pandas DataFrames are very similar.

# Pandas

# load data
df = pd.read_csv("mtcars.csv")

# view dataframe
df 
df.head(10)

# columns and data types
df.columns
df.dtypes

# rename columns
df.columns = ['a', 'b', 'c']
df.rename(columns = {'old': 'new'})

# drop column
df.drop('mpg', axis=1)

# filtering
df[df.mpg < 20]
df[(df.mpg < 20) & (df.cyl == 6)]

# add column
df['gpm'] = 1 / df.mpg # division by 0 gives inf

# fill nulls
df.fillna(0) # more options than PySpark has

# aggregation
df.groupby(['cyl', 'gear']) \
  .agg({'mpg':'mean', 'disp':'min'})
# PySpark

# load data
df = spar.read \
  .options(header=True, inferSchema=True) \
  .csv("mtcars.csv")

# view dataframe
df.show() # df represents a schema
df.show(10)

# columns and data types
df.columns
df.dtypes

# rename columns
df.toDF('a', 'b', 'c')
df.withColumnRenamed('old', 'new')

# drop column
df.drop('mpg')

# filtering
df[df.mpg < 20]
df[(df.mpg < 20) & (df.cyl == 6)]

# add column
df.withColumn('gpm', 1 / df.mpg) # division by 0 gives null

# fill nulls
df.fillna(0)

# aggregation
df.groupby(['cyl', 'gear']) \
  .agg({'mpg':'mean', 'disp':'min'})

Okay. We get the point and now let’s see what else is a little bit more diffrent.

# Pandas

# STANDARD TRANSFORMATIONS
# uses python numpy lib
import numpy as np
df['logdisp'] = np.log(df.disp)

# ROW CONDITIONAL STATEMENTS
df['cond'] = df.apply(lambda r:
  1 if r.mpg > 20 else 2 if r.cyl == 6 else 3,
  axis = 1)

# PYTHON WHEN REQUIRED
df['disp1'] = df.disp.apply(lambda x: x+1)

# MERGE/JOIN DATAFRAMES
left.merge(right, on='key')
left.merge(right, left_on='a', right_on='b')

# PIVOT TABLE
pd.pivot_table(df, values='D', \
  index=['A', 'B'], columns=['C'], \
  aggfunc=np.sum)

# SUMMARY STATISTICS
df.describe()

# HISTOGRAM
df.hist()

# SQL
n/a
# PySpark

# STANDARD TRANSFORMATIONS
# uses built-in functions 
import pyspark.sql.functions as F
df.withColumn('logdisp', F.log(df.disp))

# ROW CONDITIONAL STATEMENTS
df.withColumn('cond', \
  F.when(df.mpg > 20, 1) \
   .when(df.cyl == 6, 2 \
   .otherwise(3))
   
# PYTHON WHEN REQUIRED
from pyspark.sql.types import DoubleType
fn = F.udf(lambda x: x+1, DoubleType())
df.withColumn('disp1', fn(df.disp))

# MERGE/JOIN DATAFRAMES
left.join(right, on='key')
left.join(right, left.a == right.b)

# PIVOT TABLE
df.groupBy("A", "B").pivot("C").sum("D")

# SUMMARY STATISTICS
df.describe().show() # only count, mean, stddev, min, max

# HISTOGRAM
df.sample(False, 0.1).toPandas().hist()

# SQL
df.createOrReplaceTempView('foo')
df2 = spark.sql('select  * from foo')

  • Queues can be implemented with either an array or a linked list (with tail pointer)
  • Each queue takes O(1): Enqueue, Dequeue, Empty
  • FIFO (First In First Out)

Operations

  • Enqueue(Key): adds key to collection
  • Key Dequeue(): removes and returns least recently added key
  • Boolean Empty(): returns a boolean if elements exist or not

  • Stacks can be implemented with either an array or a linked list
  • Each stack operation takes O(1): Push, Top, Pop, Empty
  • LIFO (Last In First Out) queues

Operations

  • Push(Key): adds key to collection
  • Key Top(): returns most recently added key
  • Key Pop(): removes and returns most recently added key
  • Boolean Empty(): returns a boolean if elements exist or not

Pros and Cons

  • Additional pointer needs to be defined for every pushed element in a linked list
  • Space can be wasted if we allocate a very large array in an array

Example

  • Balanced Brackets

Array is contiguous area of memory consisting of equal-size elements indexed by contiguous integers.

  • Constant-time access to any element.
  • Constant time to add/remove at the end.
  • Linear time to add/remove at any arbitrary location.
. Add Remove
Beginning O(n) O(n)
End O(1) O(1)
Middle O(n) O(n)

There are diffrent types of arrays.

1. Static Array

  • size of the array is not changable.

2. Dynamically-allocated Array

  • max size of the array should be declared when allocating an array.

3. Dynamic / Resizable Array

  • Unlike static arrays, it can be resized.
  • Appending a new element is often constant time, but can take O(n).
  • Some space is wasted.

It has the following operations:

  • Get(i): returns element at location i
  • Set(i, val): sets element i to val
  • PushBack(val): adds val to the end
  • Remove(i): removes element at location i
  • Size(): returns the number of elements

Common Implementation

  • C++: vactor
  • Java: ArrayList
  • Python: list (the only kind of array)

Note that there are no static arrays in Python. They are all dynamic arrays.

operations Runtime
Get(i) O(1)
Set(i, val) O(1)
PushBack(val) O(n)
Remove(i) O(n)
Size() O(1)

SSH, or secure shell, is a secure protocol and the most common way of safely administering remote servers. It’s important to understand how SSH is used for authentication with Git and similar tools.

I have multiple accounts in Github and Bitbucket for the usage of work and personal. So I had to set up multiple SSH keys on a single machine.

Generate a SSH key-pair

$ ssh-keygen -t rsa -C "public.hodoo@gmail.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/davidlee/.ssh/id_rsa): /Users/davidlee/.ssh/id_rsa_github
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/davidlee/.ssh/id_rsa_github.
Your public key has been saved in /Users/davidlee/.ssh/id_rsa_github.pub.

Adding a new SSH key

# Mac
$ pbcopy < ~/.ssh/id_rsa_github.pub

Set a config file

# ~/.ssh/config
Host hodoogithub
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa_github

Let’s connect!

$ ssh -T git@hodoogithub
Hi hodoolee! You've successfully authenticated, but GitHub does not provide shell access.

Set a remote-url

$ git remote set-url origin git@hodoogithub:hoodoolee/project.git