December 02, 2017Taeyang Lee Reading time ~1 minute

Pyspark vs Pandas

PySpark vs Pandas

Spark and Pandas DataFrames are very similar.

# Pandas

# load data
df = pd.read_csv("mtcars.csv")

# view dataframe
df 
df.head(10)

# columns and data types
df.columns
df.dtypes

# rename columns
df.columns = ['a', 'b', 'c']
df.rename(columns = {'old': 'new'})

# drop column
df.drop('mpg', axis=1)

# filtering
df[df.mpg < 20]
df[(df.mpg < 20) & (df.cyl == 6)]

# add column
df['gpm'] = 1 / df.mpg # division by 0 gives inf

# fill nulls
df.fillna(0) # more options than PySpark has

# aggregation
df.groupby(['cyl', 'gear']) \
  .agg({'mpg':'mean', 'disp':'min'})

# PySpark

# load data
df = spar.read \
  .options(header=True, inferSchema=True) \
  .csv("mtcars.csv")

# view dataframe
df.show() # df represents a schema
df.show(10)

# columns and data types
df.columns
df.dtypes

# rename columns
df.toDF('a', 'b', 'c')
df.withColumnRenamed('old', 'new')

# drop column
df.drop('mpg')

# filtering
df[df.mpg < 20]
df[(df.mpg < 20) & (df.cyl == 6)]

# add column
df.withColumn('gpm', 1 / df.mpg) # division by 0 gives null

# fill nulls
df.fillna(0)

# aggregation
df.groupby(['cyl', 'gear']) \
  .agg({'mpg':'mean', 'disp':'min'})

Okay. We get the point and now let’s see what else is a little bit more diffrent.

# Pandas

# STANDARD TRANSFORMATIONS
# uses python numpy lib
import numpy as np
df['logdisp'] = np.log(df.disp)

# ROW CONDITIONAL STATEMENTS
df['cond'] = df.apply(lambda r:
  1 if r.mpg > 20 else 2 if r.cyl == 6 else 3,
  axis = 1)

# PYTHON WHEN REQUIRED
df['disp1'] = df.disp.apply(lambda x: x+1)

# MERGE/JOIN DATAFRAMES
left.merge(right, on='key')
left.merge(right, left_on='a', right_on='b')

# PIVOT TABLE
pd.pivot_table(df, values='D', \
  index=['A', 'B'], columns=['C'], \
  aggfunc=np.sum)

# SUMMARY STATISTICS
df.describe()

# HISTOGRAM
df.hist()

# SQL
n/a

# PySpark

# STANDARD TRANSFORMATIONS
# uses built-in functions 
import pyspark.sql.functions as F
df.withColumn('logdisp', F.log(df.disp))

# ROW CONDITIONAL STATEMENTS
df.withColumn('cond', \
  F.when(df.mpg > 20, 1) \
   .when(df.cyl == 6, 2 \
   .otherwise(3))
   
# PYTHON WHEN REQUIRED
from pyspark.sql.types import DoubleType
fn = F.udf(lambda x: x+1, DoubleType())
df.withColumn('disp1', fn(df.disp))

# MERGE/JOIN DATAFRAMES
left.join(right, on='key')
left.join(right, left.a == right.b)

# PIVOT TABLE
df.groupBy("A", "B").pivot("C").sum("D")

# SUMMARY STATISTICS
df.describe().show() # only count, mean, stddev, min, max

# HISTOGRAM
df.sample(False, 0.1).toPandas().hist()

# SQL
df.createOrReplaceTempView('foo')
df2 = spark.sql('select  * from foo')

September 21, 2017Taeyang Lee Reading time ~1 minute

Queue

Queues can be implemented with either an array or a linked list (with tail pointer)
Each queue takes O(1): Enqueue, Dequeue, Empty
FIFO (First In First Out)

Operations

Enqueue(Key): adds key to collection
Key Dequeue(): removes and returns least recently added key
Boolean Empty(): returns a boolean if elements exist or not

September 20, 2017Taeyang Lee Reading time ~1 minute

Stack

Stacks can be implemented with either an array or a linked list
Each stack operation takes O(1): Push, Top, Pop, Empty
LIFO (Last In First Out) queues

Operations

Push(Key): adds key to collection
Key Top(): returns most recently added key
Key Pop(): removes and returns most recently added key
Boolean Empty(): returns a boolean if elements exist or not

Pros and Cons

Additional pointer needs to be defined for every pushed element in a linked list
Space can be wasted if we allocate a very large array in an array

Example

Balanced Brackets

September 17, 2017Taeyang Lee Reading time ~1 minute

Array

Array is contiguous area of memory consisting of equal-size elements indexed by contiguous integers.

Constant-time access to any element.
Constant time to add/remove at the end.
Linear time to add/remove at any arbitrary location.

.	Add	Remove
Beginning	O(n)	O(n)
End	O(1)	O(1)
Middle	O(n)	O(n)

There are diffrent types of arrays.

1. Static Array

size of the array is not changable.

2. Dynamically-allocated Array

max size of the array should be declared when allocating an array.

3. Dynamic / Resizable Array

Unlike static arrays, it can be resized.
Appending a new element is often constant time, but can take O(n).
Some space is wasted.

It has the following operations:

Get(i): returns element at location i
Set(i, val): sets element i to val
PushBack(val): adds val to the end
Remove(i): removes element at location i
Size(): returns the number of elements

Common Implementation

C++: vactor
Java: ArrayList
Python: list (the only kind of array)

Note that there are no static arrays in Python. They are all dynamic arrays.

operations	Runtime
Get(i)	O(1)
Set(i, val)	O(1)
PushBack(val)	O(n)
Remove(i)	O(n)
Size()	O(1)

September 16, 2017Taeyang Lee Reading time ~1 minute

Generating SSH Keys

SSH, or secure shell, is a secure protocol and the most common way of safely administering remote servers. It’s important to understand how SSH is used for authentication with Git and similar tools.

I have multiple accounts in Github and Bitbucket for the usage of work and personal. So I had to set up multiple SSH keys on a single machine.

Generate a SSH key-pair

$ ssh-keygen -t rsa -C "public.hodoo@gmail.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/davidlee/.ssh/id_rsa): /Users/davidlee/.ssh/id_rsa_github
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/davidlee/.ssh/id_rsa_github.
Your public key has been saved in /Users/davidlee/.ssh/id_rsa_github.pub.

Adding a new SSH key

# Mac
$ pbcopy < ~/.ssh/id_rsa_github.pub

Set a config file

# ~/.ssh/config
Host hodoogithub
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa_github

Let’s connect!

$ ssh -T git@hodoogithub
Hi hodoolee! You've successfully authenticated, but GitHub does not provide shell access.

Set a remote-url

$ git remote set-url origin git@hodoogithub:hoodoolee/project.git

Categories

Tags

Home

I'M STILL LEARNING.

Pyspark vs Pandas