Introduction to Databases using SqlAlchemy
Relational Databases
Data about entries is organised into tables.
Each row or record is an instance of an entity.
Each column has information about an attribute.
Tables can be linked together using unique keys.
Databases support more data, multiple simultaneous users, and data quality controls.
Data types are specified for each column .
SQL is used to interact with databases e.g Microsoft SQL Server, Oracle, Postgresql, sqlite etc.
Connecting to Databases
Creating a database engine
Query the database
Steps sqlalchemy's create_engine() makes an engine to handle database connections
Needs string url of database to connect to.
SQLite url format: sqlite:///filename.db
pd.read_sql(query, engine) to load in data from a database. Arguments
query: string containing sql query to run or table to load.
engine: connection/database engine object
Getting data from a database
# Import sqlalchemy's create_engine() function
from sqlalchemy import create_engine
# Create the database engine
engine = create_engine('sqlite:///data.db')
# View the tables in the database
print(engine.table_names())
Load entire tables
# Create the database engine
engine = create_engine("sqlite:///data.db")
# Create a SQL query to load the entire weather table
query = """
SELECT *
FROM weather;
"""
# Load weather with the SQL query
weather = pd.read_sql(query, engine)
# View the first few rows of data
print(weather.head())
Selecting columns with SQL
# Create database engine for data.db
engine = create_engine("sqlite:///data.db")
# Write query to get date, tmax, and tmin from weather
query = """
SELECT date,
tmax,
tmin
FROM weather;
"""
# Make a dataframe by passing query and engine to read_sql()
temperatures = pd.read_sql(query,engine)
# View the resulting dataframe
print(temperatures)
Selecting rows
# Create query to get hpd311calls records about safety
query = """
select *
from hpd311calls
where complaint_type = 'SAFETY';
"""
# Query the database and assign result to safety_calls
safety_calls = pd.read_sql(query, engine)
# Graph the number of safety calls by borough
call_counts = safety_calls.groupby('borough').unique_key.count()
call_counts.plot.barh()
plt.show()
Filtering on multiple conditions
# Create query for records with max temps <= 32 or snow >= 1
query = """
SELECT *
FROM weather
where tmax <= 32 or
snow >= 1 ;
"""
# Query database and assign result to wintry_days
wintry_days = pd.read_sql(query, engine)
# View summary stats about the temperatures
print(wintry_days.describe())
Counting in groups
# Create query to get call counts by complaint_type
query = """
select complaint_type,
count(*)
FROM hpd311calls
group by complaint_type;
"""
# Create dataframe of call counts by issue
calls_by_issue = pd.read_sql(query, engine)
# Graph the number of calls for each housing issue
calls_by_issue.plot.barh(x="complaint_type")
plt.show()
Working with aggregate functions
# Create query to get temperature and precipitation by month
query = """
SELECT month,
MAX(tmax),
MIN(tmin),
sum(prcp)
FROM weather
GROUP BY month;
"""
# Get dataframe of monthly weather stats
weather_by_month = pd.read_sql(query, engine)
# View weather stats by month
print(weather_by_month)
Joining tables
# Query to join weather to call records by date columns
query = """
SELECT *
FROM hpd311calls
JOIN weather
ON hpd311calls.created_date = weather.date;
"""
# Create dataframe of joined tables
calls_with_weather = pd.read_sql(query,engine)
# View the dataframe to make sure all columns were joined
print(calls_with_weather.head())
Joining and filtering
# Query to get hpd311calls and precipitation values
query = """
SELECT hpd311calls.*, weather.prcp
FROM hpd311calls
join weather
on hpd311calls.created_date = weather.date;"""
# Load query results into the leak_calls dataframe
leak_calls = pd.read_sql(query, engine)
# View the dataframe
print(leak_calls.head())
or
query = """
SELECT hpd311calls.*, weather.prcp
FROM hpd311calls
JOIN weather
ON hpd311calls.created_date = weather.date
where hpd311calls.complaint_type = 'WATER LEAK';"""
# Load query results into the leak_calls dataframe
leak_calls = pd.read_sql(query, engine)
# View the dataframe
print(leak_calls.head())