Pandas & Jupyter: The Data Science Workflow
python data-science pandas
If you’re doing data science in Python, you’re using Pandas and Jupyter. They’re the standard tools for exploration, analysis, and prototyping.
Setting Up Your Environment
pip install jupyter pandas numpy matplotlib seaborn
jupyter notebook
Or use JupyterLab for a more modern interface:
pip install jupyterlab
jupyter lab
Pandas Fundamentals
DataFrames and Series
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
# Access columns (Series)
df['name'] # Returns Series
df[['name', 'age']] # Returns DataFrame
Reading Data
# CSV
df = pd.read_csv('data.csv')
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON
df = pd.read_json('data.json')
# SQL
from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost/mydb')
df = pd.read_sql('SELECT * FROM users', engine)
Basic Exploration
df.head() # First 5 rows
df.tail() # Last 5 rows
df.shape # (rows, columns)
df.info() # Column types, non-null counts
df.describe() # Statistical summary
df.columns # Column names
df.dtypes # Data types
Data Selection
By Label (loc)
df.loc[0] # First row
df.loc[0:2] # Rows 0-2 (inclusive)
df.loc[0, 'name'] # Specific cell
df.loc[:, 'name'] # All rows, one column
df.loc[df['age'] > 25] # Filter by condition
By Position (iloc)
df.iloc[0] # First row
df.iloc[0:2] # Rows 0-1 (exclusive end)
df.iloc[0, 0] # First cell
df.iloc[:, 0] # All rows, first column
Boolean Indexing
# Single condition
df[df['age'] > 25]
# Multiple conditions
df[(df['age'] > 25) & (df['city'] == 'NYC')]
# isin for multiple values
df[df['city'].isin(['NYC', 'LA'])]
# String methods
df[df['name'].str.contains('li')]
Data Cleaning
Handling Missing Data
df.isnull().sum() # Count nulls per column
df.dropna() # Drop rows with any null
df.dropna(subset=['name']) # Drop if name is null
df.fillna(0) # Fill nulls with value
df['age'].fillna(df['age'].mean()) # Fill with mean
Type Conversion
df['age'] = df['age'].astype(int)
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['type'].astype('category')
Renaming and Reordering
df.rename(columns={'old_name': 'new_name'})
df.columns = ['a', 'b', 'c']
df = df[['c', 'a', 'b']] # Reorder columns
Data Transformation
Apply Functions
# Apply to column
df['age_doubled'] = df['age'].apply(lambda x: x * 2)
# Apply to row
df['full_info'] = df.apply(
lambda row: f"{row['name']} from {row['city']}",
axis=1
)
# Vectorized operations (faster)
df['age_doubled'] = df['age'] * 2
Group By
# Aggregate by group
df.groupby('city')['age'].mean()
df.groupby('city').agg({'age': 'mean', 'salary': 'sum'})
# Multiple aggregations
df.groupby('city').agg({
'age': ['mean', 'min', 'max'],
'salary': 'sum'
})
Merge and Join
# Merge (SQL-style join)
pd.merge(df1, df2, on='user_id')
pd.merge(df1, df2, on='user_id', how='left')
# Concat (stacking)
pd.concat([df1, df2]) # Vertical stack
pd.concat([df1, df2], axis=1) # Horizontal stack
Visualization in Jupyter
import matplotlib.pyplot as plt
import seaborn as sns
# Inline plots
%matplotlib inline
# Pandas built-in
df['age'].plot(kind='hist')
df.plot(x='date', y='value', kind='line')
# Seaborn for prettier plots
sns.histplot(df['age'])
sns.scatterplot(data=df, x='age', y='salary', hue='city')
sns.boxplot(data=df, x='city', y='age')
Jupyter Best Practices
Magic Commands
%timeit df.groupby('city').mean() # Benchmark
%who # List variables
%reset # Clear namespace
%%time # Time cell execution
Display Options
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
Organized Notebooks
# Structure your notebook:
# 1. Imports and Setup
# 2. Data Loading
# 3. Exploration
# 4. Cleaning
# 5. Analysis
# 6. Visualization
# 7. Conclusions
Performance Tips
Use Vectorized Operations
# Slow
for i, row in df.iterrows():
df.at[i, 'new'] = row['a'] + row['b']
# Fast
df['new'] = df['a'] + df['b']
Memory Optimization
# Use appropriate dtypes
df['id'] = df['id'].astype('int32')
df['category'] = df['category'].astype('category')
# Check memory usage
df.info(memory_usage='deep')
Chunked Reading
# For large files
chunks = pd.read_csv('huge.csv', chunksize=10000)
for chunk in chunks:
process(chunk)
From Notebook to Production
Notebooks are for exploration. For production:
- Extract functions to .py modules
- Add tests for data transformations
- Use scripts for scheduled jobs
- Version control notebooks carefully (consider nbstripout)
Final Thoughts
Pandas and Jupyter are the gateway to data science. Master the basics—selection, filtering, grouping, merging—and you can analyze almost any dataset.
Start exploring your own data. The best way to learn is by doing.
Data is a story. Pandas helps you tell it.