SQL Mastery with Python, Pandas, and Colab: A Data Analyst's Guide
Introduction
In this blog, we dive into the fascinating world of data manipulation and analysis through the lens of Python, Pandas, SQL, and Google Colab. Our hands-on code samples serve as your gateway to understanding the intricate interplay between these powerful tools. From establishing database connections and querying data with SQL to unleashing the capabilities of Pandas for data manipulation, we’ll walk you through step-by-step examples that demystify the data analysis process. Whether you’re looking to sharpen your SQL skills, supercharge your Pandas expertise, or harness the collaborative power of Google Colab, our code snippets and tutorials provide the insights you need to excel in the world of data-driven decision-making.
Python Notebook
Import the 'sqlite3' and 'pandas' libraries to your Python script.
import sqlite3
import pandas as pd
Establish a connection to a SQLite database named 'data.db' located at the specified path '/content/data.db' and store it in the 'conn' variable.
conn = sqlite3.connect('/content/data.db')
Create a custom cell magic function named 'sql' using the 'register_cell_magic' decorator from the 'IPython.core.magic' library. Inside this function, attempt to execute the SQL query provided in the cell using the 'conn' connection to the SQLite database. If successful, return the query result as a Pandas DataFrame. If an error occurs during execution, print 'error'.
from IPython.core.magic import (register_line_magic, register_cell_magic)
@register_cell_magic
def sql(line, cell):
try:
result=conn.execute(cell)
return pd.DataFrame(result.fetchall())
except:
print('error')
Execute an SQL command to create a new table named 'students' in the SQLite database connected through 'conn'. This table will have three columns: 'id' as an INTEGER primary key, 'name' as TEXT, and 'age' as INTEGER.
conn.execute('''
CREATE TABLE students (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER
)
''')
<sqlite3.Cursor at 0x7dcb43867c40>
Insert data into the 'students' table in the connected SQLite database. Three records are added: 'John' with an age of 25, 'Alice' with an age of 22, and 'Bob' with an age of 23.
conn.execute("INSERT INTO students (name, age) VALUES ('John', 25)")
conn.execute("INSERT INTO students (name, age) VALUES ('Alice', 22)")
conn.execute("INSERT INTO students (name, age) VALUES ('Bob', 23)")
<sqlite3.Cursor at 0x7dcb438e4540>
Execute an SQL query using the custom 'sql' cell magic function. Retrieve all records from the 'students' table and display the result as a DataFrame.
%%sql
SELECT * FROM students
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | John | 25 |
1 | 2 | Alice | 22 |
2 | 3 | Bob | 23 |
Execute an SQL query to retrieve all records from the 'students' table using the 'conn' connection, store the result in the 'result' variable, and then create a Pandas DataFrame from the fetched data to display the table contents.
result = conn.execute("SELECT * FROM students")
pd.DataFrame(result.fetchall())
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | John | 25 |
1 | 2 | Alice | 22 |
2 | 3 | Bob | 23 |
Execute an SQL query to retrieve all records from the 'students' table using the 'conn' connection. Then, iterate through the query result and print each row.
result = conn.execute("SELECT * FROM students")
for row in result:
print(row)
(1, 'John', 25) (2, 'Alice', 22) (3, 'Bob', 23)
Update the 'students' table in the SQLite database. Set the age to 55 for the student with the name 'John'. The fetchall()
function is called after the update, but it won't return any rows since it's an UPDATE statement.
conn.execute("UPDATE students SET age = 55 WHERE name = 'John'").fetchall()
[]
Execute an SQL DELETE statement using the 'sql' cell magic function to remove records from the 'students' table where the name is 'Alice'.
%%sql
DELETE FROM students WHERE name = 'Alice'
Define a multi-line SQL query stored in the variable 'z'. This query inserts two new records into the 'students' table. The first record has the name 'Ron' and age '77', while the second record has the name 'Sue' and age '65'.
z='''
INSERT INTO students(name,age)
VALUES('Ron',77), ('Sue',65);
'''
conn.execute(z)
<sqlite3.Cursor at 0x7dcb438e5ac0>
result = conn.execute("SELECT * FROM students")
for row in result:
print(row)
(1, 'John', 55) (3, 'Bob', 23) (4, 'Ron', 77) (5, 'Sue', 65)
conn.commit()
conn.close()
conn = sqlite3.connect('/content/data.db')
result = conn.execute("SELECT * FROM students")
for row in result:
print(row)
(1, 'John', 55) (3, 'Bob', 23) (4, 'Ron', 77) (5, 'Sue', 65)
Download a file from the specified URL 'https://data.sfgov.org/api/views/yitu-d5am/rows.csv?accessType=DOWNLOAD' and save it as 'FilmLocations.csv' using the 'gdown' library. The 'quiet' parameter is set to suppress download progress information.
url='https://data.sfgov.org/api/views/yitu-d5am/rows.csv?accessType=DOWNLOAD'
import gdown
gdown.download(url,'FilmLocations.csv',quiet=True)
'FilmLocations.csv'
Read the data from the 'FilmLocations.csv' file into a Pandas DataFrame named 'df_film' and display the first few rows using the 'head()' function.
import pandas as pd
df_film=pd.read_csv('FilmLocations.csv')
df_film.head()
Title | Release Year | Locations | Fun Facts | Production Company | Distributor | Director | Writer | Actor 1 | Actor 2 | Actor 3 | SF Find Neighborhoods | Analysis Neighborhoods | Current Supervisor Districts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Experiment in Terror | 1962 | The Sea Captain's Chest (Fisherman's Wharf) | NaN | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 99.0 | 23.0 | 3.0 |
1 | Experiment in Terror | 1962 | 100 St. Germain Avenue | NaN | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 47.0 | 38.0 | 8.0 |
2 | Chan is Missing | 1982 | Li Po (916 Grant Avenue at Washington, Chinatown) | NaN | New Yorker Films | New Yorker Films | Wayne Wang | Wayne Wang | Wood Moy | Marc Hayashi | Lauren Chew | 104.0 | 6.0 | 3.0 |
3 | A View to a Kill | 1985 | Taylor and Jefferson Streets (Fisherman's Wharf) | NaN | Metro-Goldwyn Mayer | MGM/UA Entertainment Company | John Glen | Richard Maibaum | Roger Moore | Christopher Walken | Tanya Roberts | 99.0 | 23.0 | 3.0 |
4 | The Californians | 2005 | NaN | NaN | Parker Film Company | Fabrication Films | Jonathan Parker | Jonathan Parker & Catherine DiNapoli | Noah Wyle | NaN | NaN | 21.0 | 36.0 | 10.0 |
import pandas as pd
import sqlite3
Establish a connection to a SQLite database named 'example.db' and store it in the 'conn' variable.
conn = sqlite3.connect('example.db')
Create a new table named 'FilmLocations' in the SQLite database connected through 'conn' and populate it with the data from the 'df_film' DataFrame.
df_film.to_sql('FilmLocations', con=conn)
2084
Retrieve data from the 'FilmLocations' table in the SQLite database connected through 'conn'. Select records where the 'Release Year' is 1962, the 'Distributor' is 'Columbia Pictures', and the 'Fun Facts' column is NULL. Limit the result to the first 3 matching records and return them as a Pandas DataFrame.
pd.read_sql_query("SELECT * FROM FilmLocations WHERE [Release Year]=1962 AND Distributor = 'Columbia Pictures' AND [Fun Facts] is NULL LIMIT 3", con=conn)
index | Title | Release Year | Locations | Fun Facts | Production Company | Distributor | Director | Writer | Actor 1 | Actor 2 | Actor 3 | SF Find Neighborhoods | Analysis Neighborhoods | Current Supervisor Districts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Experiment in Terror | 1962 | The Sea Captain's Chest (Fisherman's Wharf) | None | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 99.0 | 23.0 | 3.0 |
1 | 1 | Experiment in Terror | 1962 | 100 St. Germain Avenue | None | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 47.0 | 38.0 | 8.0 |
2 | 421 | Experiment in Terror | 1962 | 1 Montgomery Street at Post | None | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 19.0 | 8.0 | 3.0 |
Execute an SQL query using the 'sql' cell magic function to retrieve data from the 'FilmLocations' table in the SQLite database. Select records where the 'Release Year' is 1962, the 'Distributor' is 'Columbia Pictures', and the 'Fun Facts' column is NULL. Limit the result to the first 3 matching records.
%%sql
SELECT * FROM FilmLocations WHERE [Release Year]=1962 AND Distributor = 'Columbia Pictures' AND [Fun Facts] is NULL LIMIT 3
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Experiment in Terror | 1962 | The Sea Captain's Chest (Fisherman's Wharf) | None | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 99.0 | 23.0 | 3.0 |
1 | 1 | Experiment in Terror | 1962 | 100 St. Germain Avenue | None | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 47.0 | 38.0 | 8.0 |
2 | 421 | Experiment in Terror | 1962 | 1 Montgomery Street at Post | None | Columbia Pictures Corporation | Columbia Pictures | Blake Edwards | The Gordons | Glenn Ford | Lee Remick | Stefanie Powers | 19.0 | 8.0 | 3.0 |
%%sql
--SELECT * FROM FilmLocations
--SELECT * FROM FilmLocations LIMIT 15 OFFSET 10;
--SELECT DISTINCT Title FROM FilmLocations LIMIT 50;
SELECT DISTINCT Title FROM FilmLocations WHERE [Release Year]=1915 LIMIT 10;
--SELECT DISTINCT Title FROM FilmLocations WHERE [ReleaseYear]=2015 LIMIT 3 OFFSET 5;
0 | |
---|---|
0 | A Jitney Elopement |
pd.read_sql_query("SELECT DISTINCT Title FROM FilmLocations WHERE [Release Year]=1915
10;", conn)
Title | |
---|---|
0 | A Jitney Elopement |
Conclusion
Concluding our exploration of Python, Pandas, SQL, and Google Colab reveals a world of data-driven possibilities. Armed with SQL, you now possess the ability to query and extract insights from vast datasets. Pandas empowers you with the tools to manipulate, analyze, and visualize data with ease. Google Colab provides a collaborative and flexible environment for your data projects. Together, these tools form a formidable toolkit for data analysts and scientists. The code snippets and tutorials provided aim to expand your knowledge and ignite your passion for the endless opportunities that data analysis brings. As you continue to explore, experiment, and innovate with these tools, you’ll witness your data-driven endeavors flourish. Thank you for joining this enlightening journey, and remember, the world of data analysis is at your fingertips.