Data Collection | Data Cleaning | Exploratory Data Analysis (EDA) | Model Building | Model Deployment
Extract Load Transform (ELT) | Dbt (Data building Tool) | Orchestration | Modules
Data Science Tutorial by Example
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions.
Read a (local) CSV File | Download from API | Download CSV from API | Write / Read Parquet File | Save to a DataFrame | Merge Data | Concatenate Data | GroupBy Data | Unique Values | Replace NaN Values |
Reading a large uncompressed comma separated value (CSV) file is the most inefficient way to access Big Data. Whenever possible, the file should be either imported into a local database, or saved to a compressed binary file such as TDMS or HDF5 for subsequent reading.
import pandas as pd
# load the Titanic dataset https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
# into a pandas dataframe (df) from: https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
#df = pd.read_csv("https://raw.githubusercontent.com/jorisvandenbossche/pandas-tutorial/master/data/titanic.csv")
# OR
df = pd.read_csv('data/titanic.csv') # Download the CSV file to a subfolder 'data' under the folder where this script resides.
If you have a CSV file that has the delimiter included within string/text columns in the data, then this will limit your ability to use fast methods to import the CSV data into a PostgreSQL database. The Pandas function df = pd.read_csv() is a more flexible option for these cases.
Reading a local CSV file with a poor structure:
#
# Written by: Mark W Kiehl
# http://mechatronicsolutionsllc.com/
# http://www.savvysolutions.info/savvycodesolutions/
#
# Create a PostgreSQL daabase table from a CSV file by deriving it's structure using Pandas,
# and then write the CSV contents to the database table (without using Pandas).
# Thereafter the CSV contents are available in the database table for extraction and analysis.
#
# This script assumes the Titanic CSV file is already downloaded locally.
# Titanic dataset https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
# THE TITANIC CSV FILE IS NOT FORMATTED WELL. It contains the delimiter ',' within
# the text/string columns, limiting the ability to use fast options such as
# cur.copy_from() to read the CSV file quickly. This script demonstrates a solution
# to that CSV read problem.
#Python performance timer
from pickle import NONE
import time
t_start_sec = time.perf_counter()
# pip install memory-profiler
# Use the package memory-profiler to measure memory consumption
# The peak memory is the difference between the starting value of the “Mem usage” column, and the highest value (also known as the “high watermark”).
# IMPORTANT: See how / where @profile is inserted before a function later in the script.
from memory_profiler import profile
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro
# pip install memory-profiler
# Use the package memory-profiler to measure memory consumption
# The peak memory is the difference between the starting value of the “Mem usage” column, and the highest value (also known as the “high watermark”).
# IMPORTANT: See how / where @profile is inserted before a function later in the script.
from memory_profiler import profile
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro
import csv
# pip install psycopg2
import psycopg2
import psycopg2.extras # for execute_batch
conn = psycopg2.connect(
host="localhost",
database="pytestdb",
user="postgres",
password="your-db-password",
)
conn.autocommit = True #set autocommit=True so every command we execute will take effect immediately
def pandas_types_to_postgresql_cols(dtypes):
# Translate Pandas data type in dtypes and return a string for a SQL create table statement, e.g. CREATE UNLOGGED TABLE <table name> (col1 INTEGER, col2 TEXT, ..)
# NOTE: The string returned may need modification!
#
# The following types (or spellings thereof) are specified by SQL: bigint, bit, bit varying, boolean, char, character varying, character, varchar, date, double precision, integer, interval, numeric, decimal, real, smallint, time (with or without time zone), timestamp (with or without time zone), xml.
# Pandas data types: string_, unicode_, mixed types
# int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64
# float_, float16, float32, float64
# bool_
# datetime64[ns]
s = '('
for key in dtypes:
#print(key,': ',dtypes[key])
if dtypes[key] == 'int' or dtypes[key] == 'int8' or dtypes[key] == 'int16' or dtypes[key] == 'int32' or dtypes[key] == 'int64' or dtypes[key] == 'uint8' or dtypes[key] == 'uint16' or dtypes[key] == 'uint32' or dtypes[key] == 'uint64':
s = s + key + ' ' + 'INTEGER,'
elif dtypes[key] == 'float' or dtypes[key] == 'float16' or dtypes[key] == 'float32' or dtypes[key] == 'float64':
s = s + key + ' ' + 'DECIMAL,'
elif dtypes[key] == 'bool':
s = s + key + ' ' + 'BOOLEAN,'
elif dtypes[key] == 'datetime64[ns]':
s = s + key + ' ' + 'TIMESTAMPTZ,' #TIMESTAMP, TIMESTAMPTZ, DATE, ..
elif dtypes[key] == 'string' or dtypes[key] == 'unicode':
s = s + key + ' ' + 'TEXT,'
else:
print('WARNING: unexpected dtypes of "{}" in pandas_types_to_postgresql_cols() for {}'.format(dtypes[key], key))
s = s + key + ':' + 'ERROR'
s = s[0:-1] + ')'
return s
def create_db_table(cur, table_name, s):
""" Connect to the local PostgreSQL database server """
try:
# connect to the PostgreSQL server
print('Connecting to the PostgreSQL database...')
# create a cursor
cur = conn.cursor()
# execute a SQL statement to get the PostgreSQL version
cur.execute('SELECT version()')
# display the PostgreSQL database server version
db_version = cur.fetchone()
print(db_version)
# Build the SQL statement to execute
sql = 'DROP TABLE IF EXISTS ' + table_name + ';'
sql = sql + 'CREATE UNLOGGED TABLE ' + table_name + ' ' + s + ';'
sql = "" + sql + ""
print(sql)
# Create a new table named table_name
cur.execute(sql)
# execute a SQL statement to get the number of records in table table_name
# (throws an error if the table doesn't exist
cur.execute('select count(*) from ' + table_name + ';')
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
print('Created db table ' + table_name)
def write_csv_to_db_table(cur, table_name, csv_filepath):
""" Connect to the local PostgreSQL database server """
try:
# connect to the PostgreSQL server
print('Connecting to the PostgreSQL database...')
# create a cursor
cur = conn.cursor()
# The CSV file is not formatted well. It contains the delimiter ',' within
# the text/string columns, limiting the ability to use fast options such as
# cur.copy_from() to read the CSV file quickly.
# Read some of the CSV data from the file to get the column information
df = pd.read_csv(csv_filepath, nrows=5)
# Create a comma separated list of the DataFrame column names
cols = ','.join(list(df.columns))
#print(cols) #PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
#Build a string with the '(%s, ... %s)' for VALUES
v = '('
for x in range(len(df.columns)):
v = v + '%s,'
v = v[0:-1]
v = v + ')'
#print(v) #(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
# The arguments to csv.reader() cause any delimiter between the quotechar to be ignored.
# This is not the most efficient / fast way to read a CSV file, but it is necessary given
# the presence of the delimiter character in the strings/text.
with open(csv_filepath, 'r') as f:
csv_reader = csv.reader(f, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)
next(csv_reader) # Skip the header row.
for row in csv_reader:
# Note: row is a list
# Below is the recommended way to call INSERT using psycopg2:
cur.execute("INSERT INTO " + table_name + " VALUES " + v, row)
except (Exception, psycopg2.DatabaseError) as error:
print('ERROR: {}'.format(error))
cur.close()
finally:
print('done .. write_csv_to_db_table()')
def alter_db_table(cur, table_name, sql):
""" Connect to the local PostgreSQL database server """
try:
# connect to the PostgreSQL server
print('Connecting to the PostgreSQL database...')
# create a cursor
cur = conn.cursor()
# execute a SQL statement to get the PostgreSQL version
cur.execute('SELECT version()')
# display the PostgreSQL database server version
db_version = cur.fetchone()
print(db_version)
# Alter the table named table_name with the statements as specified within string SQL
cur.execute(sql)
# execute a SQL statement to get the number of records in table table_name
# (throws an error if the table doesn't exist
cur.execute('select count(*) from ' + table_name + ';')
print('Altered {} rows in db table {}'.format(cur.fetchone(),table_name))
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
print('done .. alter_db_table()')
def get_db_table_struct(cur, table_name):
""" Connect to the local PostgreSQL database server """
try:
# create a cursor
cur = conn.cursor()
# Query the information_schema.columns catalog to get the structure of table_name
sql = 'SELECT table_name, column_name, data_type FROM information_schema.columns WHERE table_name = \'' + table_name + '\';'
sql = "" + sql + ""
cur.execute(sql)
tbl_struct = cur.fetchall()
return tbl_struct
except (Exception, psycopg2.DatabaseError) as error:
print(error)
@profile # instantiating the decorator for the function/code to be monitored for memory usage by memory-profiler
def main():
# Get the CVS structure information so it can be used to create a database table.
"""
# Pandas data types: string_, unicode_, mixed types
# int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64
# float_, float16, float32, float64
# bool_
# datetime64[ns]
"""
"""
Contents of file 'titanic.csv':
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
"""
# Load the CSV file 'titanic.csv' into a PostgreSQL database table.
# The best way to get the CSV file structure is to read the first n rows of the CSV file with
# Pandas, and then use what Pandas acquired from the file to create the database table structure
csv_filepath = 'data/titanic.csv'
# IMPORTANT: If the data contains dates, they may import as data type 'Object' rather than 'datetime64[ns]'.
# To fix this, add "parse_dates=['column_name']" option and update 'column_name' with the name of the column with dates.
# This resolves import of column 'datetime', but not the unix timestamps in column 'unix_s'.
df = pd.read_csv(csv_filepath, nrows=50)
#print(df.info())
"""
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 50 non-null int64
1 Survived 50 non-null int64
2 Pclass 50 non-null int64
3 Name 50 non-null object
4 Sex 50 non-null object
5 Age 36 non-null float64
6 SibSp 50 non-null int64
7 Parch 50 non-null int64
8 Ticket 50 non-null object
9 Fare 50 non-null float64
10 Cabin 9 non-null object
11 Embarked 50 non-null object
"""
# How to manually convert a DataFrame column to data type 'datetime64[ns]': df['column_name'] = pd.to_datetime(df['column_name'], unit='s')
# Manually convert column 'Name' to data type 'string'
df['Name'] = df['Name'].astype({'Name':'string'})
df['Sex'] = df['Sex'].astype({'Sex':'string'})
df['Ticket'] = df['Ticket'].astype({'Ticket':'string'})
df['Cabin'] = df['Cabin'].astype({'Cabin':'string'})
df['Embarked'] = df['Embarked'].astype({'Embarked':'string'})
df['Age'] = df['Age'].astype({'Age':'string'}) # actually a float64, but NaN values exist
#print(df.info())
#print('\n')
"""
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 50 non-null int64
1 Survived 50 non-null int64
2 Pclass 50 non-null int64
3 Name 50 non-null string
4 Sex 50 non-null string
5 Age 36 non-null float64
6 SibSp 50 non-null int64
7 Parch 50 non-null int64
8 Ticket 50 non-null string
9 Fare 50 non-null float64
10 Cabin 9 non-null string
11 Embarked 50 non-null string
"""
# Get the data types from the DataFrame and convert them from a Series to a dictionary
dtypes = df.dtypes.to_dict()
# Get a string with SQL statements for the column names and data types derived from dtypes
s = pandas_types_to_postgresql_cols(dtypes)
print(s)
#(PassengerId INTEGER,Survived INTEGER,Pclass INTEGER,Name TEXT,Sex TEXT,Age DECIMAL,SibSp INTEGER,Parch INTEGER,Ticket TEXT,Fare DECIMAL,Cabin TEXT,Embarked TEXT)
# Edit the string s if necessary before it is used to create the database table.
#s = '(PassengerId INTEGER,Survived INTEGER,Pclass INTEGER,Name TEXT,Sex TEXT,Age DECIMAL,SibSp INTEGER,Parch INTEGER,Ticket TEXT,Fare DECIMAL,Cabin TEXT,Embarked TEXT)'
print('\n')
table_name = 'titanic'
with conn.cursor() as cur:
create_db_table(cur, table_name, s)
print('\n')
# Read the CSV into the db table.
# NOTE: you will need to modify write_csv_to_db_table() for your specific CSV.
write_csv_to_db_table(cur, table_name, csv_filepath)
# Fix the table column 'Age' that is currently data type 'text' due to NaN
s = 'ALTER TABLE ' + table_name + ' ADD COLUMN unix_s_ts TIMESTAMPTZ;'
#alter_db_table(cur, table_name, sql)
# Print out the structure of table table_name
print('\nStructure of db table ' + table_name + ':')
tbl_struct = get_db_table_struct(cur, table_name);
for line in tbl_struct:
print(line)
"""
('test_a', 'datetime', 'timestamp with time zone')
('test_a', 'int', 'integer')
('test_a', 'float', 'numeric')
('test_a', 'bool1', 'boolean')
('test_a', 'unix_s', 'timestamp with time zone')
('test_a', 'text', 'text')
"""
cur.close()
conn.close()
# Report the script execution time
t_stop_sec = time.perf_counter()
print('\nElapsed time {:6f} sec'.format(t_stop_sec-t_start_sec))
if __name__ == '__main__':
main() #only executes when NOT imported from another script
Pandas makes it very easy to download a CSV file directly from an online source.
#Python performance timer
from pickle import NONE
import time
t_start_sec = time.perf_counter()
# pip install memory-profiler
# Use the package memory-profiler to measure memory consumption
# The peak memory is the difference between the starting value of the “Mem usage” column, and the highest value (also known as the “high watermark”).
# IMPORTANT: See how / where @profile is inserted before a function later in the script.
from memory_profiler import profile
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro
import json
def data_collection():
# load the Titanic dataset https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
# into a pandas dataframe (df) from: https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
df = pd.read_csv("https://raw.githubusercontent.com/jorisvandenbossche/pandas-tutorial/master/data/titanic.csv")
# OR
#df = pd.read_csv('data/titanic.csv') # Download the CSV file to a subfolder 'data' under the folder where this script resides.
return df
def show_df_contents(df):
# print general information about the DataFrame contents
print(df.info())
print('\n')
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
"""
#@profile # instantiating the decorator for the function/code to be monitored for memory usage by memory-profiler
def main():
# Load the Titanic dataset
df = data_collection()
# Show some basic information about the CSV file contents in the DataFrame df
show_df_contents(df)
# Report the script execution time
t_stop_sec = time.perf_counter()
print('\nElapsed time {:6f} sec'.format(t_stop_sec-t_start_sec))
if __name__ == '__main__':
main() #only executes when NOT imported from another script
# pip install pandas
import pandas as pd
import numpy as np
# pip install pyarrow
import pyarrow
# Create dataframe
df = pd.DataFrame([[1,'A',1.23,10],[2,'B',2.34,9],[3,np.nan,float('NaN'),None],[4,'D',4.45,7]], columns = ['idx','cat','float','int'])
# Write the Parquet file
path_file = Path(Path.cwd()).joinpath('my_dataframe.parquet')
df.to_parquet(path=path_file, compression='gzip')
print(df.head())
del df
# Read the Parquet file to a dataframe
if not path_file.exists(): raise Exception("File doesn't exist: ", path_file)
df = pd.read_parquet(path=path_file)
print(df.head())
import pandas as pd
# list to DataFrame
li = [[1,2,3],[4,5,6]]
df = pd.DataFrame(li, columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
# dictionary of lists to DataFrame
dic = {'A':[1,2,3],'B':[4,5,6]}
df = pd.DataFrame(dic)
A B
0 1 4
1 2 5
2 3 6
import pandas as pd
# merge two DataFrames (same as SQL JOIN)
df1 = pd.DataFrame([[1,'A'],[2,'B']], columns=['col1','col2'])
df2 = pd.DataFrame([['A',3],['B',4]], columns=['col2','col3'])
df = pd.merge(df1, df2, on = 'col2', how = 'inner')
col1 col2 col3
0 1 A 3
1 2 B 4
import pandas as pd
# concatenate two DataFrames. axis = 1 stacks columns together. | axis = 0 stacks rows together, provided column header match.
df1 = pd.DataFrame([[1,'A'],[2,'B']], columns=['col1','col2'])
df2 = pd.DataFrame([['A',3],['B',4]], columns=['col2','col3'])
df = pd.concat((df1, df2), axis = 1)
col1 col2 col2 col3
0 1 A A 3
1 2 B B 4
import pandas as pd
df = pd.DataFrame([[1,'A'],[2,'B'],[3,'A'],[4,'C']], columns = ['col1','col2'])
print(df.head())
col1 col2
0 1 A
1 2 B
2 3 A
3 4 C
print(df.groupby('col2').col1.sum())
col2
A 4
B 2
C 4
import pandas as pd
df = pd.DataFrame([[1,'A'],[2,'B'],[3,'A'],[4,'C']], columns = ['col1','col2'])
print(df.head())
col1 col2
0 1 A
1 2 B
2 3 A
3 4 C
# Get a list of unique values in DataFrame column 'col2'
print(df['col2'].unique())
['A' 'B' 'C']
# Get the count of unique values in DataFrame column 'col2'
print(df['col2'].value_counts())
A 2
B 1
C 1
Pandas is built on top of NumPy. If a NumPy array contains NaNs, NumPy's aggregate functions (np.mean, np.min, np.max, etc.) return NaN. NumPy also provides nan-insensitive methods, such as np.nansum, np.nanmin, etc.
import pandas as pd
df = pd.DataFrame([[1,float('NaN')],[2,'B'],[3,np.nan],[4,'C']], columns = ['col1','col2'])
print(df.head())
col1 col2
0 1 NaN
1 2 B
2 3 NaN
3 4 C
df.col2.fillna('B', inplace = True)
col1 col2
0 1 B
1 2 B
2 3 B
3 4 C
20% of Pandas Functions that Data Scientists Use 80% of the Time
Data Validation should be configured prior to building a data model. How to Do Data Validation on Your Data on Pandas with pytest
Missing Values | Timestamp | Encoding Categorical Fields | Standardizing Numeric Columns | Binning Numeric Columns | Min/Max Scaling | Robust Scale | Power Transformations | Quantile Transformations | Box-Cox Transformations |
There are various ways to handle missing values, such as filling them with the mean, median, or mode of the column, or dropping rows with missing values.
df = pd.DataFrame([[1,'A',1.23,10],[2,'B',2.34,9],[3,np.nan,float('NaN'),None],[4,'D',4.45,7]], columns = ['idx','cat','float','int'])
"""
print(df.head(), '\n')
idx cat float int
0 1 A 1.23 10.0
1 2 B 2.34 9.0
2 3 NaN NaN NaN
3 4 D 4.45 7.0
"""
# Get the count of missing values by column
"""
print('DataFrames with missing values:\n',df.isnull().sum(),'\n')
DataFrames with missing values:
idx 0
cat 1
float 1
int 1
"""
# Calculate the count and % of missing values
"""
total = df.isnull().sum().sort_values(ascending=False)
percent_1 = df.isnull().sum()/df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Count', '%'])
print('DataFrames with missing values:\n',missing_data.head(5))
DataFrames with missing values:
Count %
cat 1 25.0
float 1 25.0
int 1 25.0
idx 0 0.0
"""
# Replace the missing int column values with random values generated based
# on the mean int and the standard deviation of the int columnn.
# (The column must be of data type float and normally distributed)
"""
mean = df['int'].mean()
#print('int column mean = ', mean) # int column mean = 8.666666666666666
std = df['int'].std()
#print('int std = ', std) # int std = 1.5275252316519468
is_null = df['int'].isnull().sum()
#print(is_null, 'int values are null') # 1 int values are null
print('Replacing column int nan values with ', is_null, ' random values between ', mean - std, ' and ', mean + std, ' based on the mean and std of column int')
# random.randint(low, high=None, size=None, dtype=int)
rand_int = np.random.randint(mean - std, mean + std, size = is_null) # rand_int is a numpy array with 7 values
#print('rand_int = ', pd.DataFrame(rand_int).to_string(header=False, index=False)) # 9
# Replace NaN values in int column with the random values generated
int_copy = df['int'].copy()
int_copy[np.isnan(int_copy)] = rand_int # np.isnan() tests element-wise whether it is NaN or not and returns the result as a boolean array
df['int'] = int_copy
#print(df['int'].isnull().sum(), 'int values are null') # 0 int values are null
print(df.head(), '\n')
idx cat float int
0 1 A 1.23 10.0
1 2 B 2.34 9.0
2 3 NaN NaN 9.0
3 4 D 4.45 7.0
"""
# Replace NaN value for column 'int' with value of 8
#df['int'].fillna(8, inplace=True)
"""
print(df.head())
print('\n')
idx cat float int
0 1 A 1.23 10.0
1 2 B 2.34 9.0
2 3 NaN NaN 8.0
3 4 D 4.45 7.0
"""
# Replace NaN value for column 'cat' with value of A
# (WARNING: Do not use: df['cat'] = df['cat'].fillna('A', inplace=True))
#df['cat'].fillna('A', inplace=True)
# Replace the NaN value for column 'float' with an interpolated value
# DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=’forward’, limit_area=None, downcast=None, **kwargs)
#df['float'] = df['float'].interpolate(method ='linear', limit_direction ='forward')
"""
idx cat float int
0 1 A 1.230 10
1 2 B 2.340 9
2 3 A 3.395 8
3 4 D 4.450 7
"""
# Replace NaN values for cloumn 'float' with the median
#df['float'].fillna(df['float'].median(), inplace=True)
# Replace NaN values for column 'float' with the mode
"""
#print(df['float'].isnull().sum(), ' missing values in column float') # 1 missing values in column float
mode = df['float'].mode()
mode = mode.iat[0] # Convert to a string
# OR: mode = df['float'].mode().iat[0]
#print('float column mode = ', mode) # S
print('Replacing ', df['float'].isnull().sum(), ' missing values in column float with the mode of ', mode)
# Replacing 1 missing values in column float with the mode of 1.23
df['float'].fillna(mode, inplace=True)
print(df['float'].isnull().sum(), ' missing values in column float') # 0 missing values in column float
"""
# Create a new column from the values in column 'float' using the Pandas .apply() function with a custom function df_prod().
"""
def df_prod(val):
return val * 3.0
df['float3'] = df['float'].apply(df_prod)
print(df.head())
print('\n')
# Delete the column 'float3'
df.drop(columns=['float3'], inplace=True)
"""
# Fix a datetime index missing dates
"""
# NOTE: Gaps exist in the index (days missing).
# Close/Last Volume Open High Low
#Date
#2019-01-02 12.55 3210083 12.25 12.6300 12.1372
#2019-01-03 12.39 2534732 12.42 12.5200 12.3525
#2019-01-04 12.84 2743466 12.67 12.9100 12.6150
#2019-01-07 12.97 3163269 12.84 13.0500 12.7750
#2019-01-08 13.00 2384501 13.04 13.0900 12.9200
#... ... ... ... ... ...
#2022-12-23 18.62 1607966 18.52 18.6650 18.5000
#2022-12-27 18.64 1001465 18.63 18.7400 18.5850
#2022-12-28 18.54 1226361 18.71 18.7699 18.5400
#2022-12-29 18.80 1217531 18.78 18.8300 18.7400
#2022-12-30 18.67 1592404 18.71 18.8000 18.5500
# Create an index with the date range of interest
idx = pd.date_range('01-01-2019', '12-31-2022', freq="D")
# Update the dataframe with the new index, and backfill the data columns with any missing data
df = df.reindex(idx, method='bfill')
# Extract a subset to get full yearly data from 2019 to 2022
df = df.loc['2019-01-01':'2022-12-31']
"""
# pip install pandas
import pandas as pd
# Create dataframe
df = pd.DataFrame([[1,'2004-03-10 18:00:00',1.23,10],
[2,'2004-03-10 19:00:00',3.38,9],
[3,'2004-03-10 18:00:00',2.34,8],
[4,'2004-03-10 21:00:00',4.45,7]],
columns = ['idx','timestamp','float','int'])
print(df.head())
#converting 'timestamp' string to datetime
df["Timestamp"] = pd.to_datetime(df["timestamp"])
#data type after conversion
print("After conversion: ",type(df["Timestamp"].loc[0]))
#latest timestamp
print(df["Timestamp"].max())
#earliest timestamp
print(df["Timestamp"].min())
#difference between timestamps
print(df["Timestamp"].max() - df["Timestamp"].min())
#extract year into new column
df["Year"] = df["Timestamp"].dt.year
#extract quarter into new column
df["Quarter"] = df["Timestamp"].dt.quarter
#extract month into new column
df["Month"] = df["Timestamp"].dt.month
#extract week number into new column (not valid for Pandas V2)
#df["Week Number"] = df["Timestamp"].dt.week
print(df[["Timestamp","Year","Quarter","Month"]].head())
#Convert the date object values to a new column of type integer
df['timestamp_ns'] = df['Timestamp'].astype('int64')
#Create a column with the difference in the timestamp values
df['diff]'] = df['Timestamp'].diff()
# Determine if the index has freq set (or if it is None)
if pd.to_datetime(df.index).freq is None:
print("Index freq is None")
else:
print("Index freq is ", pd.to_datetime(df.index).freq)
# Determine the most likely frequency / period for the index (without using the freq attribute)
# Returns string values for frequency: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
if isinstance(df.index, pd.DatetimeIndex): print("pd.infer_freq(df.index): ", pd.infer_freq(df.index))
# Determine if the index is of type datetime (), or numeric ()
if isinstance(df.index, pd.DatetimeIndex):
print("Dataframe index is of type datetime")
else:
print("Dataframe index is of type numeric")
# If the index is of type datetime, determine and set the index frequency if it is None
if isinstance(df.index, pd.DatetimeIndex) and pd.to_datetime(df.index).freq is None:
print("df.index.freq before: ", df.index.freq)
# The method .infer_freq() will determine the most likely frequency from the index,
# but it will fail if the index is not sorted or has gaps.
# Set the frequency to hour "H" (the data must already have a frequency of hour)
# Note that the datetime index is NOT converted to the specified frequency/period.
# String values for frequency: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
#df.index.freq = "H" # manual manipulation
df.index.freq = pd.infer_freq(df.index)
# alternative methods:
#df = df.asfreq(pd.infer_freq(df.index))
#df.index = df.index.to_period(pd.infer_freq(df.index))
print("df.index.freq after: ", df.index.freq)
# Convert the datetime index to a particular frequency
# String values for frequency: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
df = df.asfreq("min")
print(df)
print("df.index.freq: ", df.index.freq)
import pandas as pd
# Create dataframe with a datetime64[ns] index of freq nanoseconds
print("\nCreate dataframe with a datetime64[ns] index of freq nanoseconds")
datetime_series = pd.Series(pd.date_range("2000-01-01", periods=4, freq="ns")) # freq="ns" or "N"
df = pd.DataFrame([[1.23, 4.45, 10],
[3.38, 3.34, 9],
[2.34, 2.34, 8],
[4.45, 1.23, 7]],
columns = ['float_rnd', 'float_lin', 'int'])
df.set_index(datetime_series, inplace=True)
df.index.rename("Timestamp", inplace=True)
Fields (variables) that contain a limited number of categories. One-hot encoding converts the categories into a numerical value that can be used by a machine learning (ML) model. Pandas example.
Scale the values of a numeric column so that they have zero mean and unit variance. A ML model will be sensitive to wide variations in scale among the numeric columns. Frequently they will be normalized to a floating point value between 0 and 1. Pandas example.
Allocate the values of a continuous numeric column into bins, converting them into a categorical column.
Scale the values of a numeric column so that they have a minimum value of 0 and a maximum value of 1. Pandas example.
When the data contains outliers, scale the values of a numeric column using the median and interquartile range to minimize the outlier influence.
Used for correcting the skewness of a distribution, as skewed distributions can cause problems when fitting certain types of models.
Transform the values of a numeric column so that they have a uniform or normal distribution. Useful for improving the assumptions of certain ML models that assume the predictor variables are normally distributed.
Transform the values of a numeric column so that they are approximately normally distributed. Useful for improving the assumptions of certain ML models that assume the predictor variables are normally distributed.
Distribution | Correlation | Statistics | Scaling | Categorical Values | | | |
Summarize the main characteristics of a data set, identifying patterns in the data, and the existance of outliers and features of the data that are unexpected.
# Create a simple Pandas DataFrame
df = pd.DataFrame([[1,'A',1.23,10],[2,'B',2.34,9],[3,np.nan,float('NaN'),None],[4,'D',4.45,7]], columns = ['idx','cat','float','int'])
# Visualize the distribution of the numeric column 'float' to make sure the values are normally distributed
sns.histplot(df['float']) # or plt.hist(df['float'])
plt.show()
# Check visually if numeric values are normally distributed using a Q-Q plot.
# (quantiles of two variables are plotted against each other)
# Tutorial: https://towardsdatascience.com/significance-of-q-q-plots-6f0c6e31c626
stats.probplot(df['float'], plot=plt)
plt.show()
# Visualize the distribution of a categorical column
df['cat'].value_counts().plot(kind='bar')
plt.show()
# Create a distplot to visualize the distribution of a numeric column (better than a KDE plot)
#sns.distplot(df['float']) # depreciated. Replaced by `displot` (a figure-level function with similar flexibility)
sns.distplot(df['float'])
plt.show()
# Create a KDE plot to visualize the kernel density estimate of a numeric column
sns.kdeplot(df['float'])
plt.show()
# create a box plot to visualize the distribution of a numeric column
plt.boxplot(df['float'])
plt.ylabel('float')
plt.show()
# Create a box plot to visualize the distribution of a numeric column by the categories of a categorical column.
sns.boxplot(x='cat', y='int', data=df)
plt.ylabel('int')
plt.show()
# Create a boxenplot to visualize the distribution of a numeric column by the categories of a categorical column
sns.boxenplot(x='cat', y='float', data=df)
plt.ylabel('float')
plt.show()
# Create a violin plot to visualize the distribution of a numeric column by the categories of a categorical column
sns.violinplot(x='cat', y='int', data=df)
plt.ylabel('int')
plt.show()
# Create a swarm plot to visualize the distribution of a numeric column by the categories of a categorical column
sns.swarmplot(x='cat', y='float', data=df)
plt.ylabel('float')
plt.show()
# Create a faceting grid to visualize the distribution of multiple numeric columns by the categories of a categorical column
g = sns.FacetGrid(df, col='cat')
g.map(plt.hist, 'int')
plt.show()
# Create a rugplot to visualize the distribution of a numeric column and the individual observations
# (rugplot adds ticks on axis to show the location of individual observations)
sns.set_theme()
sns.kdeplot(data=df, x='float')
sns.rugplot(data=df, x='float')
plt.show()
# Create a heatmap to visualize the correlation between multiple numeric columns
sns.heatmap(df[['float','int']].corr(), cmap='RdYlGn', annot=True)
plt.show()
# Create a scatter plot to visualize the relationship between two numeric columns
plt.scatter(df['float'], df['int'])
plt.xlabel('float')
plt.ylabel('int')
plt.show()
# create a pairplot to visualize the relationships between multiple numeric columns
sns.pairplot(df, vars=['cat', 'float', 'int'])
plt.show()
#Create a jointplot to visualize the relationship between two numeric columns and their distributions
sns.jointplot(x='cat', y='int', data=df)
plt.show()
# Create a lag plot to check for autocorrelation in a numeric column.
# (autocorrelation is the degree of correlation of a variables values over time)
from pandas.plotting import lag_plot
lag_plot(df['float'])
plt.title('Lag Plot of Autocorrelation of float')
plt.show()
# Create an autocorrelation plot to visualize the autocorrelation in a numeric column
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(df['float'])
plt.title('Autocorrelation of float')
plt.show()
# Create a bar plot to visualize the count of a categorical column
df['cat'].value_counts().plot(kind='bar')
plt.ylabel('Count')
plt.show()
# Create a bar plot to visualize the mean of a numeric column for each category of a categorical column
df.groupby('cat')['float'].mean().plot(kind='bar')
plt.ylabel('float average by cat')
plt.title('Mean of float by cat')
plt.show()
# Create a pivot table to summarize the data in terms of the mean
pivot_table = df.pivot_table(index='cat', columns='float', values='int', aggfunc='mean')
print(pivot_table)
"""
float 1.230 2.340 3.395 4.450
cat
A 10.0 NaN 8.0 NaN
B NaN 9.0 NaN NaN
D NaN NaN NaN 7.0
"""
# Create a heatmap to visualize the contents of a pivot table
plt.pcolor(pivot_table, cmap='Reds')
plt.colorbar()
plt.title('Pivot Table')
plt.show()
# Create a point plot to visualize the mean of a numeric column by the categories of a categorical column
sns.pointplot(x='cat', y='float', data=df)
plt.ylabel('Average float')
plt.show()
# Create a pointplot to visualize the mean and confidence interval of a numeric column by the categories of a categorical column
sns.pointplot(x='cat', y='float', data=df, errorbar=('ci', 95))
plt.ylabel('Average float')
plt.title('PointPlot to Visualize Mean & Confidence Interval')
plt.show()
# Create a scatter plot matrix to visualize the relationships between multiple numeric columns
from pandas.plotting import scatter_matrix
scatter_matrix(df[['int', 'float']], alpha=0.2, figsize=(6, 6))
plt.show()
# Create a regression plot to visualize the relationship between two numeric columns
sns.regplot(x='int', y='float', data=df)
plt.show()
# Create a barplot to visualize the mean of a numeric column by the categories of a categorical column
sns.barplot(x='cat', y='int', data=df)
plt.ylabel('Average of int')
plt.show()
# Create a lmplot to visualize the relationship between two numeric columns and the categories of a categorical column
sns.lmplot(x='int', y='float', hue='cat', data=df)
plt.show()
# Scaling can improve the convergence speed of various ML algorithms, especially with data sets that have a large variation.
# The Normalization method scales values between 0 and 1.
# The Standardization method scales values between -1 and 1, with a mean of 0.
# If the distrubution is normal, then it should be standardized, otherwise normalized.
# https://medium.com/mlearning-ai/feature-scaling-normalization-or-standardization-74d73ec90366
# Min-Max Normalization:
# Formula: df[‘column’] = df[‘column’] – df[‘column’].min()
# Formula: df[‘column’] = df[‘column’] / df[‘column’].max()
df['float'] = df['float'] - df['float'].min()
df['float'] = df['float'] / df['float'].max()
print(df['float'].head())
"""
0 0.00000
1 0.34472
2 0.67236
3 1.00000
"""
# Standardization:
# Formula: df[‘column’] =( df[‘column’] – df[‘column’].mean() ) / df[‘column’].std()
df['float'] = (df['float'] - df['float'].mean() ) / df['float'].std()
print(df['float'].head())
"""
0 -1.173730
1 -0.371365
2 0.391243
3 1.153852
"""
# Apply a box-cox transformation to a numeric column
from scipy import stats
# transform data & save lambda value
fitted_data, fitted_lambda = stats.boxcox(df['float'])
print(fitted_data)
print(f"Lambda value used for Transformation: {fitted_lambda}")
"""
[0.22291806 1.16527003 1.94135308 2.64853539]
Lambda value used for Transformation: 0.7064823317828955
"""
This code automatically scales the data using Standardization or Normalization after examining the data to determine if it is normally distributed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def scale_df(df):
"""
If the Series df is Gaussian (normally distributed), then it is
standardized (values between -1 and 1, with a mean of 0).
If the Series df is NOT Gaussian (normally distributed) and tests
positive as log-normal distribution, then it will be Normalized.
If the Series df is NOT Gaussian/normally distributed and is not
a log-normal distribution, then it will be Normalized
(also known as min-max scaling).
"""
# Scaling can improve the convergence speed of various ML algorithms, especially with data sets that have a large variation.
# The Normalization method scales values between 0 and 1.
# The Standardization method scales values between -1 and 1, with a mean of 0.
# If the distrubution is normal, then it should be standardized, otherwise normalized.
# Normalization is a better option when we are not necessarily dealing with a normally distirbuted
# data and you are not worried about standardization along the variance axes (i.e. image processing
# or neural network expecting values between 0–1).
# https://medium.com/mlearning-ai/feature-scaling-normalization-or-standardization-74d73ec90366
from scipy.stats import shapiro
# Check numerically if a numeric column is normally distributed using the Shapiro-Wilk test
from scipy.stats import normaltest
stat, p1 = shapiro(df)
# Check numerically if a numeric column is normally distributed using the K-squiared test
from scipy.stats import shapiro
stat, p2 = shapiro(df)
# Check numerically if a numeric column is normally distributed using the K-squiared test
# (use for a log-normal distribution)
from scipy.stats import kstest
#perform Kolmogorov-Smirnov test for normality
stat, pvalue = kstest(df, 'norm')
if p1 > 0.05 and p2 > 0.05 and pvalue <= 0.05:
print('Series "', df.name,'" is probably Gaussian (normally distributed) and will be scaled using Standardization.\n')
df = (df - df.mean() ) / df.std()
elif pvalue > 0.05:
print('Series "', df.name, '" is a Log-Normal distribution and will be scaled using Standardization')
df = (df - df.mean() ) / df.std()
else:
print('Series "', df.name,'" is NOT Gaussian (normally distributed) or Log-Normal distribution, so it will be scaled using Normalization (min-max scaling).\n')
print('stat=%.3f, p1=%0.3f' % (stat,p1))
print('stat=%.3f, p2=%0.3f' % (stat,p2))
print('stat=%.3f, pvalue=%0.3f' % (stat,pvalue))
# Min-Max Normalization:
# Formula: df[‘column’] = df[‘column’] – df[‘column’].min()
# Formula: df[‘column’] = df[‘column’] / df[‘column’].max()
df = df - df.min()
df = df / df.max()
return df
df = pd.DataFrame([[1,'A',1.23,10],[2,'B',2.34,9],[3,'C',3.45,8],[4,'D',4.56,7]], columns = ['idx','cat','float','int'])
print('\n', df.head(), '\n')
"""
idx cat float int
0 1 A 1.23 10
1 2 B 2.34 9
2 3 C 3.45 8
3 4 D 4.56 7
"""
df['float'] = scale_df(df['float'])
print('\n', df.head(), '\n')
# Encoding categorical variables:
# https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd
# Encode categorical variables using one-hot encoding
"""
print(df.head())
df = pd.get_dummies(df, columns=['cat'], prefix=['df'])
print(df.head())
idx float int df_A df_B df_D
0 1 1.230 10 1 0 0
1 2 2.340 9 0 1 0
2 3 3.395 8 1 0 0
3 4 4.450 7 0 0 1
"""
# Encode categorical variables using ordinal encoding
"""
print(df.head())
df['cat_numerical'] = pd.factorize(df['cat'])[0] + 1
#df['cat'] = pd.factorize(df['cat'])[0] + 1
print(df.head())
idx cat float int cat_numerical
0 1 A 1.230 10 1
1 2 B 2.340 9 2
2 3 A 3.395 8 1
3 4 D 4.450 7 3
"""
# Encode categorical variables using label encoding
# (retains the natural order, but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them)
"""
# Convert 'cat' from data type string to category
df['cat'] = df['cat'].astype('category')
df['cat_enc'] = df['cat'].cat.codes
print(df.head())
"""
"""
idx cat float int cat_enc
0 1 A 1.230 10 0
1 2 B 2.340 9 1
2 3 A 3.395 8 0
3 4 D 4.450 7 2
"""
Intro to Descriptive Statistics
How to Find the Best Theoretical Distribution for Your Data
Extract, load, transform is an alternative to extract, transform, load (ETL) used with data lake implementations. In contrast to ETL, in ELT models the data is not transformed on entry to the data lake, but stored in its original raw format (enables faster loading). After loading, the data is transformed into a usable and trusted resource. Thereafter the data can be accessed by end-users to answer business related questions.
This is the T (transform) in ETL/ELT processes, and handles converting select statements into tables and views.
Data orchestration is the process of taking siloed data from multiple data storage locations, combining and organizing it, and making it available for data analysis tools. .
NumPy is numerical mathematics. It is faster than native Python because many of the operations are implemented in C and use optimized algorithms.
Matplotlib is a plotting library. Its numerical mathematics extension is NumPy.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
SciPy is an open source library for scientific computing.
Function Optimization With SciPy
EDA: Exploratory Data Analysis Cheatsheet (highly recommended - start with this one first)
Scientific Functions in NumPy and SciPy
Churn Modeling: A detailed step-by-step Guide in Python
EDA: IKEA Product Analysis and Price Prediction Using Linear Regression
Dashboards for Data Analysis and Data Science
Literally Everything You Can Do With Time Series! (yahoofinancials Python wrapper around the Yahoo Finance API)
Python Solutions
Sitemap | Copyright © 2017 - 2025 Mechatronic Solutions LLC
Web site by www.MechatronicSolutionsLLC.com | | 19.6800 ms