SQL Mastery with Python, Pandas, and Colab: A Data Analyst’s Guide

Introduction

In this blog, we dive into the fascinating world of data manipulation and analysis through the lens of Python, Pandas, SQL, and Google Colab. Our hands-on code samples serve as your gateway to understanding the intricate interplay between these powerful tools. From establishing database connections and querying data with SQL to unleashing the capabilities of Pandas for data manipulation, we’ll walk you through step-by-step examples that demystify the data analysis process. Whether you’re looking to sharpen your SQL skills, supercharge your Pandas expertise, or harness the collaborative power of Google Colab, our code snippets and tutorials provide the insights you need to excel in the world of data-driven decision-making.

Python Notebook

Import the 'sqlite3' and 'pandas' libraries to your Python script.

In [1]:

import sqlite3
import pandas as pd

Establish a connection to a SQLite database named 'data.db' located at the specified path '/content/data.db' and store it in the 'conn' variable.

In [2]:

conn = sqlite3.connect('/content/data.db')

Create a custom cell magic function named 'sql' using the 'register_cell_magic' decorator from the 'IPython.core.magic' library. Inside this function, attempt to execute the SQL query provided in the cell using the 'conn' connection to the SQLite database. If successful, return the query result as a Pandas DataFrame. If an error occurs during execution, print 'error'.

In [3]:

from IPython.core.magic import (register_line_magic, register_cell_magic)
@register_cell_magic
def sql(line, cell):
    try:
        result=conn.execute(cell)
        return pd.DataFrame(result.fetchall())
    except:
        print('error')

Execute an SQL command to create a new table named 'students' in the SQLite database connected through 'conn'. This table will have three columns: 'id' as an INTEGER primary key, 'name' as TEXT, and 'age' as INTEGER.

In [4]:

conn.execute('''
    CREATE TABLE students (
        id INTEGER PRIMARY KEY,
        name TEXT,
        age INTEGER
    )
''')

Out[4]:

<sqlite3.Cursor at 0x7dcb43867c40>

Insert data into the 'students' table in the connected SQLite database. Three records are added: 'John' with an age of 25, 'Alice' with an age of 22, and 'Bob' with an age of 23.

In [5]:

conn.execute("INSERT INTO students (name, age) VALUES ('John', 25)")
conn.execute("INSERT INTO students (name, age) VALUES ('Alice', 22)")
conn.execute("INSERT INTO students (name, age) VALUES ('Bob', 23)")

Out[5]:

<sqlite3.Cursor at 0x7dcb438e4540>

Execute an SQL query using the custom 'sql' cell magic function. Retrieve all records from the 'students' table and display the result as a DataFrame.

In [6]:

%%sql
SELECT * FROM students

Out[6]:

	0	1	2
0	1	John	25
1	2	Alice	22
2	3	Bob	23

Execute an SQL query to retrieve all records from the 'students' table using the 'conn' connection, store the result in the 'result' variable, and then create a Pandas DataFrame from the fetched data to display the table contents.

In [7]:

result = conn.execute("SELECT * FROM students")
pd.DataFrame(result.fetchall())

Out[7]:

	0	1	2
0	1	John	25
1	2	Alice	22
2	3	Bob	23

Execute an SQL query to retrieve all records from the 'students' table using the 'conn' connection. Then, iterate through the query result and print each row.

In [8]:

result = conn.execute("SELECT * FROM students")
for row in result:
    print(row)

(1, 'John', 25)
(2, 'Alice', 22)
(3, 'Bob', 23)

Update the 'students' table in the SQLite database. Set the age to 55 for the student with the name 'John'. The fetchall() function is called after the update, but it won't return any rows since it's an UPDATE statement.

In [9]:

conn.execute("UPDATE students SET age = 55 WHERE name = 'John'").fetchall()

Out[9]:

[]

Execute an SQL DELETE statement using the 'sql' cell magic function to remove records from the 'students' table where the name is 'Alice'.

In [10]:

%%sql
DELETE FROM students WHERE name = 'Alice'

Out[10]:

Define a multi-line SQL query stored in the variable 'z'. This query inserts two new records into the 'students' table. The first record has the name 'Ron' and age '77', while the second record has the name 'Sue' and age '65'.

In [11]:

z='''
INSERT INTO students(name,age)
VALUES('Ron',77), ('Sue',65);
'''

In [12]:

conn.execute(z)

Out[12]:

<sqlite3.Cursor at 0x7dcb438e5ac0>

In [13]:

result = conn.execute("SELECT * FROM students")
for row in result:
    print(row)

(1, 'John', 55)
(3, 'Bob', 23)
(4, 'Ron', 77)
(5, 'Sue', 65)

In [14]:

conn.commit()

In [15]:

conn.close()

In [16]:

conn = sqlite3.connect('/content/data.db')

In [17]:

result = conn.execute("SELECT * FROM students")

In [18]:

for row in result:
    print(row)

(1, 'John', 55)
(3, 'Bob', 23)
(4, 'Ron', 77)
(5, 'Sue', 65)

In [18]:

Download a file from the specified URL 'https://data.sfgov.org/api/views/yitu-d5am/rows.csv?accessType=DOWNLOAD' and save it as 'FilmLocations.csv' using the 'gdown' library. The 'quiet' parameter is set to suppress download progress information.

In [19]:

url='https://data.sfgov.org/api/views/yitu-d5am/rows.csv?accessType=DOWNLOAD'
import gdown
gdown.download(url,'FilmLocations.csv',quiet=True)

Out[19]:

'FilmLocations.csv'

Read the data from the 'FilmLocations.csv' file into a Pandas DataFrame named 'df_film' and display the first few rows using the 'head()' function.

In [20]:

import pandas as pd
df_film=pd.read_csv('FilmLocations.csv')
df_film.head()

Out[20]:

	Title	Release Year	Locations	Fun Facts	Production Company	Distributor	Director	Writer	Actor 1	Actor 2	Actor 3	SF Find Neighborhoods	Analysis Neighborhoods	Current Supervisor Districts
0	Experiment in Terror	1962	The Sea Captain's Chest (Fisherman's Wharf)	NaN	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	99.0	23.0	3.0
1	Experiment in Terror	1962	100 St. Germain Avenue	NaN	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	47.0	38.0	8.0
2	Chan is Missing	1982	Li Po (916 Grant Avenue at Washington, Chinatown)	NaN	New Yorker Films	New Yorker Films	Wayne Wang	Wayne Wang	Wood Moy	Marc Hayashi	Lauren Chew	104.0	6.0	3.0
3	A View to a Kill	1985	Taylor and Jefferson Streets (Fisherman's Wharf)	NaN	Metro-Goldwyn Mayer	MGM/UA Entertainment Company	John Glen	Richard Maibaum	Roger Moore	Christopher Walken	Tanya Roberts	99.0	23.0	3.0
4	The Californians	2005	NaN	NaN	Parker Film Company	Fabrication Films	Jonathan Parker	Jonathan Parker & Catherine DiNapoli	Noah Wyle	NaN	NaN	21.0	36.0	10.0

In [21]:

import pandas as pd
import sqlite3

Establish a connection to a SQLite database named 'example.db' and store it in the 'conn' variable.

In [22]:

conn = sqlite3.connect('example.db')

Create a new table named 'FilmLocations' in the SQLite database connected through 'conn' and populate it with the data from the 'df_film' DataFrame.

In [23]:

df_film.to_sql('FilmLocations', con=conn)

Out[23]:

Retrieve data from the 'FilmLocations' table in the SQLite database connected through 'conn'. Select records where the 'Release Year' is 1962, the 'Distributor' is 'Columbia Pictures', and the 'Fun Facts' column is NULL. Limit the result to the first 3 matching records and return them as a Pandas DataFrame.

In [24]:

pd.read_sql_query("SELECT * FROM FilmLocations WHERE [Release Year]=1962 AND Distributor = 'Columbia Pictures' AND [Fun Facts] is NULL LIMIT 3", con=conn)

Out[24]:

	index	Title	Release Year	Locations	Fun Facts	Production Company	Distributor	Director	Writer	Actor 1	Actor 2	Actor 3	SF Find Neighborhoods	Analysis Neighborhoods	Current Supervisor Districts
0	0	Experiment in Terror	1962	The Sea Captain's Chest (Fisherman's Wharf)	None	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	99.0	23.0	3.0
1	1	Experiment in Terror	1962	100 St. Germain Avenue	None	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	47.0	38.0	8.0
2	421	Experiment in Terror	1962	1 Montgomery Street at Post	None	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	19.0	8.0	3.0

Execute an SQL query using the 'sql' cell magic function to retrieve data from the 'FilmLocations' table in the SQLite database. Select records where the 'Release Year' is 1962, the 'Distributor' is 'Columbia Pictures', and the 'Fun Facts' column is NULL. Limit the result to the first 3 matching records.

In [25]:

%%sql
SELECT * FROM FilmLocations WHERE [Release Year]=1962 AND Distributor = 'Columbia Pictures' AND [Fun Facts] is NULL LIMIT 3

Out[25]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
0	0	Experiment in Terror	1962	The Sea Captain's Chest (Fisherman's Wharf)	None	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	99.0	23.0	3.0
1	1	Experiment in Terror	1962	100 St. Germain Avenue	None	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	47.0	38.0	8.0
2	421	Experiment in Terror	1962	1 Montgomery Street at Post	None	Columbia Pictures Corporation	Columbia Pictures	Blake Edwards	The Gordons	Glenn Ford	Lee Remick	Stefanie Powers	19.0	8.0	3.0

In [26]:

%%sql
--SELECT * FROM FilmLocations
--SELECT * FROM FilmLocations LIMIT 15 OFFSET 10;
--SELECT DISTINCT Title FROM FilmLocations LIMIT 50;
SELECT DISTINCT Title FROM FilmLocations WHERE [Release Year]=1915 LIMIT 10;
--SELECT DISTINCT Title FROM FilmLocations WHERE [ReleaseYear]=2015 LIMIT 3 OFFSET 5;

Out[26]:

	0
0	A Jitney Elopement

In [27]:

pd.read_sql_query("SELECT DISTINCT Title FROM FilmLocations WHERE [Release Year]=1915 

 10;", conn)

Out[27]:

	Title
0	A Jitney Elopement

In [27]:

Conclusion

Concluding our exploration of Python, Pandas, SQL, and Google Colab reveals a world of data-driven possibilities. Armed with SQL, you now possess the ability to query and extract insights from vast datasets. Pandas empowers you with the tools to manipulate, analyze, and visualize data with ease. Google Colab provides a collaborative and flexible environment for your data projects. Together, these tools form a formidable toolkit for data analysts and scientists. The code snippets and tutorials provided aim to expand your knowledge and ignite your passion for the endless opportunities that data analysis brings. As you continue to explore, experiment, and innovate with these tools, you’ll witness your data-driven endeavors flourish. Thank you for joining this enlightening journey, and remember, the world of data analysis is at your fingertips.