Python is useful for data science because it lets you move from raw files to analysis, charts, and simple models in one workflow. This guide follows that path with practical Python code: set up the core libraries, inspect and clean a dataset, explore patterns, train a first model, and turn the output into useful insights.
Set up a Python data science stack that matches the job
A basic Python data science setup does not need many tools. Start with a small set of libraries that covers the usual workflow: loading data, transforming it, visualizing it, and building a model.
| Tool | Main job | Use it when you need to |
|---|---|---|
| pandas | Work with tables of data | Read CSV or newer Excel files, clean columns, group rows, calculate summaries |
| NumPy | Handle arrays and numerical operations | Do fast math, create numerical features, support pandas and scikit-learn workflows |
| matplotlib | Build charts | Make line charts, bar charts, histograms, and other basic plots |
| seaborn | Build cleaner statistical charts | Compare categories, show distributions, and make quick exploratory visuals |
| scikit-learn | Train and evaluate machine learning models | Split data, preprocess features, fit models, make predictions, measure results |
| JupyterLab or notebooks | Run code interactively | Explore data step by step and see tables, charts, and notes in one place |
A notebook is often the easiest place to begin because data science work is exploratory. You run a few lines, inspect the result, adjust, and keep moving. A script is more useful once the workflow is stable and you want to rerun it the same way every time.
A simple setup path looks like this:
python -m venv .venv
On macOS or Linux:
source .venv/bin/activate
On Windows PowerShell:
.venv\Scripts\Activate.ps1
Then install the common packages:
pip install pandas numpy matplotlib seaborn scikit-learn jupyterlab openpyxl
Start JupyterLab with:
jupyter lab
The openpyxl package helps pandas read newer Excel workbook formats such as .xlsx and .xlsm. Other Excel formats may require different engines or a conversion first.
Real story
I once loaded a dataset in Python and spent 20 minutes bragging about my “analysis” while staring at a perfectly blank plot. Then I noticed I had accidentally filtered the data down to one row: a typo, a coffee stain, and a very confident-looking NaN. My model was technically trained, but it had all the predictive power of a toaster.
Have a story of your own? Share it in the comments below.
Load a dataset and inspect the parts that matter first
Before cleaning or modeling anything, load the data and check what Python actually received. That catches many problems early, including missing columns, dates stored as text, empty values, or numbers imported as strings.
Assume you have a file named sales.csv with columns such as:
order_idorder_datecustomer_idregionchannelproduct_categoryorder_valueitemsreturned
If you do not have a file yet, you can use this tiny sample dataset to run the examples. It is only for practice, not for meaningful business conclusions.
from io import StringIO
import pandas as pd
sample_csv = StringIO("""
order_id,order_date,customer_id,region,channel,product_category,order_value,items,returned
1001,2024-01-05,C001,North,Online,Apparel,79.99,2,no
1002,2024-01-08,C002,South,Store,Home,149.50,3,yes
1003,2024-02-03,C003,West,Online,Electronics,399.00,1,no
1004,2024-02-14,C004,East,Marketplace,Apparel,58.25,1,yes
1005,2024-03-02,C005,North,Store,Beauty,34.75,2,no
1006,2024-03-11,C006,South,Online,Home,220.00,4,no
1007,2024-04-01,C007,West,Marketplace,Electronics,520.00,2,yes
1008,2024-04-18,C008,East,Online,Beauty,45.00,1,no
""")
df = pd.read_csv(sample_csv)
Here is a practical first pass when you are working with your own file.
-
Import pandas and load the file.
import pandas as pd df = pd.read_csv("sales.csv")For Excel files, use:
df = pd.read_excel("sales.xlsx", sheet_name="Orders") -
Preview the first few rows.
df.head()This shows whether the columns look right and whether the values match your expectations. For example,
order_valueshould appear as a number, not a mix of currency symbols and text. -
Check the shape of the dataset.
df.shapeThe result is:
(rows, columns)If you expected thousands of rows and see 12, something likely went wrong during export, filtering, or loading.
-
Inspect column names and data types.
df.columns df.info()df.info()is one of the most useful early checks. It shows each column, how many non-missing values it has, and its data type. -
Count missing values.
df.isna().sum().sort_values(ascending=False)This shows where the gaps are. A few missing values in
regionmay be easy to handle. Many missing values in the target column for a model may change the whole plan. -
Look at quick numerical summaries.
df.describe()This helps catch odd values. If
itemshas negative numbers, ororder_valuehas a maximum that is wildly larger than everything else, pause before building charts or models.
Clean, reshape, and validate the data before any analysis
Cleaning is not busywork. It decides whether your charts and models are built on something trustworthy. The aim is not perfect data; it is data that is clear enough for the question you are trying to answer.
Here is a practical cleanup sequence.
-
Standardize column names.
Column names are easier to work with when they are lowercase, consistent, and free of spaces.
df.columns = ( df.columns .str.strip() .str.lower() .str.replace(" ", "_") ) -
Remove exact duplicate rows.
df = df.drop_duplicates()If duplicates are meaningful in your system, investigate before removing them. For example, two rows with the same customer may be normal. Two rows with the same
order_idmay not be. -
Convert dates and numeric fields.
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce") df["order_value"] = pd.to_numeric(df["order_value"], errors="coerce") df["items"] = pd.to_numeric(df["items"], errors="coerce")The
errors="coerce"option turns invalid values into missing values. That makes problems visible instead of hiding them inside text. -
Normalize the
returnedtarget if you plan to analyze or model returns.The later return-rate and classification examples assume
returnedis encoded as1for returned and0for not returned. If the raw data uses labels such asyesandno, normalize and validate the column before splitting data or calculating return rates.return_label_map = { "yes": 1, "y": 1, "true": 1, "t": 1, "1": 1, "1.0": 1, "no": 0, "n": 0, "false": 0, "f": 0, "0": 0, "0.0": 0 } returned_clean = df["returned"].astype("string").str.strip().str.lower() unknown_returned = returned_clean[ returned_clean.notna() & ~returned_clean.isin(return_label_map.keys()) ].unique() assert len(unknown_returned) == 0, f"Unexpected returned labels: {unknown_returned}" df["returned"] = returned_clean.map(return_label_map) df = df.dropna(subset=["returned"]) df["returned"] = df["returned"].astype(int) assert df["returned"].isin([0, 1]).all()If your dataset uses different labels, add them to the mapping deliberately rather than letting pandas guess.
-
Handle missing values based on the meaning of each column.
df["region"] = df["region"].fillna("Unknown") df["channel"] = df["channel"].fillna("Unknown") df["product_category"] = df["product_category"].fillna("Unknown")For numeric columns, choose a strategy that fits the analysis. If
order_date,order_value, anditemsare all critical, you might remove rows missing any of them:df = df.dropna(subset=["order_date", "order_value", "items"])Or, if
itemscan reasonably be imputed butorder_dateandorder_valuecannot, drop rows missing the critical fields and fillitemswith its median:df = df.dropna(subset=["order_date", "order_value"]) df["items"] = df["items"].fillna(df["items"].median())Do not fill missing values automatically just because pandas allows it. Ask what the missing value means.
-
Filter values that do not make sense for the analysis.
df = df[df["order_value"] >= 0] df = df[df["items"] > 0]Negative order values may be refunds in some systems. If so, keep them and label them properly. If they are data errors, remove or correct them.
-
Create useful analysis fields.
df["order_month"] = df["order_date"].dt.to_period("M").astype(str) df["revenue_per_item"] = df["order_value"] / df["items"]New columns like these make later analysis easier. They also make charts easier to read.
-
Validate after transformations.
print(df.shape) print(df[["order_value", "items", "revenue_per_item"]].describe()) print(df.isna().sum().sort_values(ascending=False).head(10))A few simple checks can prevent quiet mistakes. For example, if
revenue_per_itemcontains infinity, you probably still have rows whereitemsis zero.
You can also use assertions for rules that must always be true:
assert (df["order_value"] >= 0).all()
assert (df["items"] > 0).all()
assert df["order_date"].notna().all()
assert df["returned"].isin([0, 1]).all()
Assertions are direct. If a rule breaks, Python stops and tells you. It is less polite than a chart, but often more useful.
Explore the data with pandas summaries and visualizations
Once the data is clean enough, start with summaries before moving into models. Summaries show the broad shape of the data. Charts help you see patterns, outliers, and relationships that are hard to spot in rows.
Example: summarize sales by category
category_summary = (
df.groupby("product_category")
.agg(
orders=("order_id", "count"),
total_sales=("order_value", "sum"),
average_order_value=("order_value", "mean"),
average_items=("items", "mean")
)
.sort_values("total_sales", ascending=False)
)
category_summary
This produces a compact table showing which categories bring in the most sales, how many orders they have, and whether the average order value differs by category.
If one category has high total sales but a low average order value, it may be driven by volume. If another has fewer orders but a high average value, it may deserve different attention.
Example: plot sales by category
import matplotlib.pyplot as plt
import seaborn as sns
top_categories = category_summary.head(10).reset_index()
plt.figure(figsize=(10, 5))
sns.barplot(
data=top_categories,
x="total_sales",
y="product_category"
)
plt.title("Top Product Categories by Total Sales")
plt.xlabel("Total Sales")
plt.ylabel("Product Category")
plt.tight_layout()
plt.show()
A bar chart is useful when you want to compare groups. Keep it simple. The goal is to see the pattern, not to prove that your chart can wear formal clothes.
Example: look at sales over time
monthly_sales = (
df.groupby("order_month")
.agg(total_sales=("order_value", "sum"))
.reset_index()
)
plt.figure(figsize=(12, 5))
sns.lineplot(
data=monthly_sales,
x="order_month",
y="total_sales",
marker="o"
)
plt.title("Monthly Sales")
plt.xlabel("Month")
plt.ylabel("Total Sales")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
A line chart helps you spot changes over time. Look for steady growth, sudden drops, seasonal movement, or unusual spikes.
Example: inspect the distribution of order values
plt.figure(figsize=(8, 5))
sns.histplot(df["order_value"], bins=30)
plt.title("Distribution of Order Values")
plt.xlabel("Order Value")
plt.ylabel("Number of Orders")
plt.tight_layout()
plt.show()
A histogram shows whether most orders are small, whether a few very large orders exist, and whether the distribution is skewed. This matters for both analysis and modeling.
Outliers are not always errors. A large order may be a real enterprise purchase. The important part is noticing it and deciding how to handle it.
Build a first machine learning model with scikit-learn
A first model should be simple. The point is to build a clean training flow, not to chase a complex algorithm. Start with a baseline, then try a straightforward model and evaluate it on data the model did not see during training.
In this example, assume the dataset includes a normalized returned column that marks whether an order was returned: 1 means returned and 0 means not returned. The goal is to predict that value from order details.
-
Choose the target and features.
target = "returned" features = [ "region", "channel", "product_category", "order_value", "items", "revenue_per_item" ] model_df = df[features + [target]].dropna() assert model_df[target].isin([0, 1]).all()Avoid features that would not be known at prediction time. For example, do not use a refund date to predict whether an order will be returned. That would be data leakage: the model would be learning from the answer key.
-
Split the data into training and test sets.
from sklearn.model_selection import train_test_split X = model_df[features] y = model_df[target] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )The training set is for fitting the model. The test set is for evaluation. Keep them separate.
The
stratify=yoption helps preserve the target balance in both sets. This is useful when one class is much more common than the other. -
Build a baseline model.
A baseline gives you something simple to beat. For classification, a dummy model can predict the most frequent class.
from sklearn.dummy import DummyClassifier from sklearn.metrics import accuracy_score, classification_report baseline = DummyClassifier(strategy="most_frequent") baseline.fit(X_train, y_train) baseline_predictions = baseline.predict(X_test) print("Baseline accuracy:", accuracy_score(y_test, baseline_predictions)) print(classification_report(y_test, baseline_predictions))If your real model only barely beats the baseline, it may not be useful yet.
-
Create preprocessing for numeric and categorical columns.
Scikit-learn models need numbers. Categorical columns such as
regionandchannelmust be encoded.from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.pipeline import Pipeline numeric_features = ["order_value", "items", "revenue_per_item"] categorical_features = ["region", "channel", "product_category"] numeric_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ] ) categorical_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")) ] ) preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ] )A pipeline keeps preprocessing and modeling together. That reduces mistakes and makes the workflow easier to rerun.
-
Fit a simple model.
Logistic regression is a reasonable first classification model. It is not always the best model, but it is often a good starting point.
from sklearn.linear_model import LogisticRegression clf = Pipeline( steps=[ ("preprocessor", preprocessor), ("model", LogisticRegression(max_iter=1000)) ] ) clf.fit(X_train, y_train) -
Predict and evaluate on the test set.
predictions = clf.predict(X_test) print("Model accuracy:", accuracy_score(y_test, predictions)) print(classification_report(y_test, predictions))Accuracy is easy to understand, but it can be misleading when classes are imbalanced. If only a small share of orders are returned, a model can look accurate by predicting “not returned” most of the time.
Read precision, recall, and F1-score too:
- Precision asks: when the model predicts a return, how often is it right?
- Recall asks: of the actual returns, how many did the model catch?
- F1-score balances precision and recall.
If returns are rare, the default decision threshold may not match your goal. As a next step, compare predicted probabilities at different thresholds, or try
LogisticRegression(max_iter=1000, class_weight="balanced"). Tune these choices with a validation approach rather than repeatedly adjusting to the final test set. -
Inspect model coefficients carefully.
With logistic regression, coefficients can show which encoded features are associated with higher or lower predicted return risk. They do not prove cause and effect, and they can be affected by preprocessing, correlated features, and small sample sizes.
feature_names = clf.named_steps["preprocessor"].get_feature_names_out() coefficients = ( pd.Series( clf.named_steps["model"].coef_[0], index=feature_names ) .sort_values(key=lambda values: values.abs(), ascending=False) ) coefficients.head(10)Positive coefficients push the model toward predicting
returned = 1; negative coefficients push it towardreturned = 0. Use this as a diagnostic clue, not as a final explanation. -
Check for common beginner mistakes.
- Do not evaluate only on the training set. That tells you how well the model memorized familiar data.
- Do not include columns that reveal the answer after the fact.
- Do not use complex models before building a baseline.
- Do not treat one metric as the whole story.
- Do not ignore whether the prediction would be available early enough to be useful.
A clean modeling workflow is more valuable than a fancy model with unclear inputs.
Turn model results and charts into a clear insight
The final step is to connect the Python output back to a practical question. A chart, a grouped table, and a model metric are not insights by themselves. They become useful when they explain what is happening and suggest what to do next.
For example, suppose your exploration shows that one product category has a higher return rate than others, and the coefficient check shows that product_category, order_value, or channel is strongly associated with the model’s predictions. You might write a short conclusion like this:
Return risk appears higher for certain product categories, especially when order values are larger and orders come through specific channels. The first model performs better than the baseline, but recall is still limited, so it should not be used as an automated decision tool yet. A practical next step is to review product descriptions, sizing or fit information, and fulfillment notes for the highest-return categories, then retrain the model after those fixes are tested.
That conclusion is more useful than saying “the model accuracy is 0.78” and walking away. Metrics need context.
You can also create a small table to support the finding:
return_summary = (
df.groupby("product_category")
.agg(
orders=("order_id", "count"),
return_rate=("returned", "mean"),
average_order_value=("order_value", "mean")
)
.sort_values("return_rate", ascending=False)
)
return_summary.head(10)
Because returned was normalized to 1 and 0 during cleanup, the mean of that column is the return rate for each group.
Then pair it with a chart:
top_return_categories = return_summary.head(10).reset_index()
plt.figure(figsize=(10, 5))
sns.barplot(
data=top_return_categories,
x="return_rate",
y="product_category"
)
plt.title("Highest Return Rates by Product Category")
plt.xlabel("Return Rate")
plt.ylabel("Product Category")
plt.tight_layout()
plt.show()
A good Python data science workflow moves in a steady order: set up the tools, inspect the data, clean it, explore it, model it, and explain what the result means. Each step protects the next one. If the cleaning is careless, the charts mislead you. If the evaluation is weak, the model looks better than it is. If the conclusion is vague, the work stays trapped in the notebook.
Keep the first version simple. A clear pandas summary, one useful chart, and a baseline scikit-learn model can teach you more than a complicated workflow that nobody can explain.
