Pandas & Jupyter: The Data Science Workflow

March 28, 2019

python data-science pandas

If you’re doing data science in Python, you’re using Pandas and Jupyter. They’re the standard tools for exploration, analysis, and prototyping.

Setting Up Your Environment

pip install jupyter pandas numpy matplotlib seaborn
jupyter notebook

Or use JupyterLab for a more modern interface:

pip install jupyterlab
jupyter lab

Pandas Fundamentals

DataFrames and Series

import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})

# Access columns (Series)
df['name']        # Returns Series
df[['name', 'age']]  # Returns DataFrame

Reading Data

# CSV
df = pd.read_csv('data.csv')

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('data.json')

# SQL
from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost/mydb')
df = pd.read_sql('SELECT * FROM users', engine)

Basic Exploration

df.head()        # First 5 rows
df.tail()        # Last 5 rows
df.shape         # (rows, columns)
df.info()        # Column types, non-null counts
df.describe()    # Statistical summary
df.columns       # Column names
df.dtypes        # Data types

Data Selection

By Label (loc)

df.loc[0]                    # First row
df.loc[0:2]                  # Rows 0-2 (inclusive)
df.loc[0, 'name']            # Specific cell
df.loc[:, 'name']            # All rows, one column
df.loc[df['age'] > 25]       # Filter by condition

By Position (iloc)

df.iloc[0]                   # First row
df.iloc[0:2]                 # Rows 0-1 (exclusive end)
df.iloc[0, 0]                # First cell
df.iloc[:, 0]                # All rows, first column

Boolean Indexing

# Single condition
df[df['age'] > 25]

# Multiple conditions
df[(df['age'] > 25) & (df['city'] == 'NYC')]

# isin for multiple values
df[df['city'].isin(['NYC', 'LA'])]

# String methods
df[df['name'].str.contains('li')]

Data Cleaning

Handling Missing Data

df.isnull().sum()            # Count nulls per column
df.dropna()                  # Drop rows with any null
df.dropna(subset=['name'])   # Drop if name is null
df.fillna(0)                 # Fill nulls with value
df['age'].fillna(df['age'].mean())  # Fill with mean

Type Conversion

df['age'] = df['age'].astype(int)
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['type'].astype('category')

Renaming and Reordering

df.rename(columns={'old_name': 'new_name'})
df.columns = ['a', 'b', 'c']
df = df[['c', 'a', 'b']]     # Reorder columns

Data Transformation

Apply Functions

# Apply to column
df['age_doubled'] = df['age'].apply(lambda x: x * 2)

# Apply to row
df['full_info'] = df.apply(
    lambda row: f"{row['name']} from {row['city']}", 
    axis=1
)

# Vectorized operations (faster)
df['age_doubled'] = df['age'] * 2

Group By

# Aggregate by group
df.groupby('city')['age'].mean()
df.groupby('city').agg({'age': 'mean', 'salary': 'sum'})

# Multiple aggregations
df.groupby('city').agg({
    'age': ['mean', 'min', 'max'],
    'salary': 'sum'
})

Merge and Join

# Merge (SQL-style join)
pd.merge(df1, df2, on='user_id')
pd.merge(df1, df2, on='user_id', how='left')

# Concat (stacking)
pd.concat([df1, df2])                    # Vertical stack
pd.concat([df1, df2], axis=1)            # Horizontal stack

Visualization in Jupyter

import matplotlib.pyplot as plt
import seaborn as sns

# Inline plots
%matplotlib inline

# Pandas built-in
df['age'].plot(kind='hist')
df.plot(x='date', y='value', kind='line')

# Seaborn for prettier plots
sns.histplot(df['age'])
sns.scatterplot(data=df, x='age', y='salary', hue='city')
sns.boxplot(data=df, x='city', y='age')

Jupyter Best Practices

Magic Commands

%timeit df.groupby('city').mean()    # Benchmark
%who                                  # List variables
%reset                                # Clear namespace
%%time                                # Time cell execution

Display Options

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

Organized Notebooks

# Structure your notebook:
# 1. Imports and Setup
# 2. Data Loading
# 3. Exploration
# 4. Cleaning
# 5. Analysis
# 6. Visualization
# 7. Conclusions

Performance Tips

Use Vectorized Operations

# Slow
for i, row in df.iterrows():
    df.at[i, 'new'] = row['a'] + row['b']

# Fast
df['new'] = df['a'] + df['b']

Memory Optimization

# Use appropriate dtypes
df['id'] = df['id'].astype('int32')
df['category'] = df['category'].astype('category')

# Check memory usage
df.info(memory_usage='deep')

Chunked Reading

# For large files
chunks = pd.read_csv('huge.csv', chunksize=10000)
for chunk in chunks:
    process(chunk)

From Notebook to Production

Notebooks are for exploration. For production:

Extract functions to .py modules
Add tests for data transformations
Use scripts for scheduled jobs
Version control notebooks carefully (consider nbstripout)

Final Thoughts

Pandas and Jupyter are the gateway to data science. Master the basics—selection, filtering, grouping, merging—and you can analyze almost any dataset.

Start exploring your own data. The best way to learn is by doing.

Data is a story. Pandas helps you tell it.