Data Science in Python: Analyze Data and Build Models

Python is useful for data science because it lets you move from raw files to analysis, charts, and simple models in one workflow. This guide follows that path with practical Python code: set up the core libraries, inspect and clean a dataset, explore patterns, train a first model, and turn the output into useful insights.

Set up a Python data science stack that matches the job

A basic Python data science setup does not need many tools. Start with a small set of libraries that covers the usual workflow: loading data, transforming it, visualizing it, and building a model.

Tool	Main job	Use it when you need to
pandas	Work with tables of data	Read CSV or newer Excel files, clean columns, group rows, calculate summaries
NumPy	Handle arrays and numerical operations	Do fast math, create numerical features, support pandas and scikit-learn workflows
matplotlib	Build charts	Make line charts, bar charts, histograms, and other basic plots
seaborn	Build cleaner statistical charts	Compare categories, show distributions, and make quick exploratory visuals
scikit-learn	Train and evaluate machine learning models	Split data, preprocess features, fit models, make predictions, measure results
JupyterLab or notebooks	Run code interactively	Explore data step by step and see tables, charts, and notes in one place

A notebook is often the easiest place to begin because data science work is exploratory. You run a few lines, inspect the result, adjust, and keep moving. A script is more useful once the workflow is stable and you want to rerun it the same way every time.

A simple setup path looks like this:

python -m venv .venv

On macOS or Linux:

source .venv/bin/activate

On Windows PowerShell:

.venv\Scripts\Activate.ps1

Then install the common packages:

pip install pandas numpy matplotlib seaborn scikit-learn jupyterlab openpyxl

Start JupyterLab with:

jupyter lab

The openpyxl package helps pandas read newer Excel workbook formats such as .xlsx and .xlsm. Other Excel formats may require different engines or a conversion first.

Real story

I once loaded a dataset in Python and spent 20 minutes bragging about my “analysis” while staring at a perfectly blank plot. Then I noticed I had accidentally filtered the data down to one row: a typo, a coffee stain, and a very confident-looking NaN. My model was technically trained, but it had all the predictive power of a toaster.

Have a story of your own? Share it in the comments below.

Load a dataset and inspect the parts that matter first

Before cleaning or modeling anything, load the data and check what Python actually received. That catches many problems early, including missing columns, dates stored as text, empty values, or numbers imported as strings.

Assume you have a file named sales.csv with columns such as:

order_id
order_date
customer_id
region
channel
product_category
order_value
items
returned

If you do not have a file yet, you can use this tiny sample dataset to run the examples. It is only for practice, not for meaningful business conclusions.

from io import StringIO
import pandas as pd

sample_csv = StringIO("""
order_id,order_date,customer_id,region,channel,product_category,order_value,items,returned
1001,2024-01-05,C001,North,Online,Apparel,79.99,2,no
1002,2024-01-08,C002,South,Store,Home,149.50,3,yes
1003,2024-02-03,C003,West,Online,Electronics,399.00,1,no
1004,2024-02-14,C004,East,Marketplace,Apparel,58.25,1,yes
1005,2024-03-02,C005,North,Store,Beauty,34.75,2,no
1006,2024-03-11,C006,South,Online,Home,220.00,4,no
1007,2024-04-01,C007,West,Marketplace,Electronics,520.00,2,yes
1008,2024-04-18,C008,East,Online,Beauty,45.00,1,no
""")

df = pd.read_csv(sample_csv)

Here is a practical first pass when you are working with your own file.

Import pandas and load the file.

import pandas as pd

df = pd.read_csv("sales.csv")

For Excel files, use:

df = pd.read_excel("sales.xlsx", sheet_name="Orders")

Preview the first few rows.
```
df.head()
```
This shows whether the columns look right and whether the values match your expectations. For example, order_value should appear as a number, not a mix of currency symbols and text.
Check the shape of the dataset.
```
df.shape
```
The result is:
```
(rows, columns)
```
If you expected thousands of rows and see 12, something likely went wrong during export, filtering, or loading.
Inspect column names and data types.
```
df.columns
df.info()
```
df.info() is one of the most useful early checks. It shows each column, how many non-missing values it has, and its data type.
Count missing values.
```
df.isna().sum().sort_values(ascending=False)
```
This shows where the gaps are. A few missing values in region may be easy to handle. Many missing values in the target column for a model may change the whole plan.
Look at quick numerical summaries.
```
df.describe()
```
This helps catch odd values. If items has negative numbers, or order_value has a maximum that is wildly larger than everything else, pause before building charts or models.

Clean, reshape, and validate the data before any analysis

Cleaning is not busywork. It decides whether your charts and models are built on something trustworthy. The aim is not perfect data; it is data that is clear enough for the question you are trying to answer.

Here is a practical cleanup sequence.

Standardize column names.

Column names are easier to work with when they are lowercase, consistent, and free of spaces.
```
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)
```
Remove exact duplicate rows.
```
df = df.drop_duplicates()
```
If duplicates are meaningful in your system, investigate before removing them. For example, two rows with the same customer may be normal. Two rows with the same order_id may not be.

Convert dates and numeric fields.

df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["order_value"] = pd.to_numeric(df["order_value"], errors="coerce")
df["items"] = pd.to_numeric(df["items"], errors="coerce")

The errors="coerce" option turns invalid values into missing values. That makes problems visible instead of hiding them inside text.

Normalize the returned target if you plan to analyze or model returns.

The later return-rate and classification examples assume returned is encoded as 1 for returned and 0 for not returned. If the raw data uses labels such as yes and no, normalize and validate the column before splitting data or calculating return rates.

return_label_map = {
    "yes": 1,
    "y": 1,
    "true": 1,
    "t": 1,
    "1": 1,
    "1.0": 1,
    "no": 0,
    "n": 0,
    "false": 0,
    "f": 0,
    "0": 0,
    "0.0": 0
}

returned_clean = df["returned"].astype("string").str.strip().str.lower()

unknown_returned = returned_clean[
    returned_clean.notna() & ~returned_clean.isin(return_label_map.keys())
].unique()

assert len(unknown_returned) == 0, f"Unexpected returned labels: {unknown_returned}"

df["returned"] = returned_clean.map(return_label_map)
df = df.dropna(subset=["returned"])
df["returned"] = df["returned"].astype(int)

assert df["returned"].isin([0, 1]).all()

If your dataset uses different labels, add them to the mapping deliberately rather than letting pandas guess.

Handle missing values based on the meaning of each column.
```
df["region"] = df["region"].fillna("Unknown")
df["channel"] = df["channel"].fillna("Unknown")
df["product_category"] = df["product_category"].fillna("Unknown")
```
For numeric columns, choose a strategy that fits the analysis. If order_date, order_value, and items are all critical, you might remove rows missing any of them:
```
df = df.dropna(subset=["order_date", "order_value", "items"])
```
Or, if items can reasonably be imputed but order_date and order_value cannot, drop rows missing the critical fields and fill items with its median:
```
df = df.dropna(subset=["order_date", "order_value"])
df["items"] = df["items"].fillna(df["items"].median())
```
Do not fill missing values automatically just because pandas allows it. Ask what the missing value means.
Filter values that do not make sense for the analysis.
```
df = df[df["order_value"] >= 0]
df = df[df["items"] > 0]
```
Negative order values may be refunds in some systems. If so, keep them and label them properly. If they are data errors, remove or correct them.

Create useful analysis fields.

df["order_month"] = df["order_date"].dt.to_period("M").astype(str)
df["revenue_per_item"] = df["order_value"] / df["items"]

New columns like these make later analysis easier. They also make charts easier to read.

Validate after transformations.
```
print(df.shape)
print(df[["order_value", "items", "revenue_per_item"]].describe())
print(df.isna().sum().sort_values(ascending=False).head(10))
```
A few simple checks can prevent quiet mistakes. For example, if revenue_per_item contains infinity, you probably still have rows where items is zero.

You can also use assertions for rules that must always be true:

assert (df["order_value"] >= 0).all()
assert (df["items"] > 0).all()
assert df["order_date"].notna().all()
assert df["returned"].isin([0, 1]).all()

Assertions are direct. If a rule breaks, Python stops and tells you. It is less polite than a chart, but often more useful.

Explore the data with pandas summaries and visualizations

Once the data is clean enough, start with summaries before moving into models. Summaries show the broad shape of the data. Charts help you see patterns, outliers, and relationships that are hard to spot in rows.

Example: summarize sales by category

category_summary = (
    df.groupby("product_category")
    .agg(
        orders=("order_id", "count"),
        total_sales=("order_value", "sum"),
        average_order_value=("order_value", "mean"),
        average_items=("items", "mean")
    )
    .sort_values("total_sales", ascending=False)
)

category_summary

This produces a compact table showing which categories bring in the most sales, how many orders they have, and whether the average order value differs by category.

If one category has high total sales but a low average order value, it may be driven by volume. If another has fewer orders but a high average value, it may deserve different attention.

Example: plot sales by category

import matplotlib.pyplot as plt
import seaborn as sns

top_categories = category_summary.head(10).reset_index()

plt.figure(figsize=(10, 5))
sns.barplot(
    data=top_categories,
    x="total_sales",
    y="product_category"
)
plt.title("Top Product Categories by Total Sales")
plt.xlabel("Total Sales")
plt.ylabel("Product Category")
plt.tight_layout()
plt.show()

A bar chart is useful when you want to compare groups. Keep it simple. The goal is to see the pattern, not to prove that your chart can wear formal clothes.

Example: look at sales over time

monthly_sales = (
    df.groupby("order_month")
    .agg(total_sales=("order_value", "sum"))
    .reset_index()
)

plt.figure(figsize=(12, 5))
sns.lineplot(
    data=monthly_sales,
    x="order_month",
    y="total_sales",
    marker="o"
)
plt.title("Monthly Sales")
plt.xlabel("Month")
plt.ylabel("Total Sales")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

A line chart helps you spot changes over time. Look for steady growth, sudden drops, seasonal movement, or unusual spikes.

Example: inspect the distribution of order values

plt.figure(figsize=(8, 5))
sns.histplot(df["order_value"], bins=30)
plt.title("Distribution of Order Values")
plt.xlabel("Order Value")
plt.ylabel("Number of Orders")
plt.tight_layout()
plt.show()

A histogram shows whether most orders are small, whether a few very large orders exist, and whether the distribution is skewed. This matters for both analysis and modeling.

Outliers are not always errors. A large order may be a real enterprise purchase. The important part is noticing it and deciding how to handle it.

Build a first machine learning model with scikit-learn

A first model should be simple. The point is to build a clean training flow, not to chase a complex algorithm. Start with a baseline, then try a straightforward model and evaluate it on data the model did not see during training.

In this example, assume the dataset includes a normalized returned column that marks whether an order was returned: 1 means returned and 0 means not returned. The goal is to predict that value from order details.

Choose the target and features.
```
target = "returned"

features = [
    "region",
    "channel",
    "product_category",
    "order_value",
    "items",
    "revenue_per_item"
]

model_df = df[features + [target]].dropna()

assert model_df[target].isin([0, 1]).all()
```
Avoid features that would not be known at prediction time. For example, do not use a refund date to predict whether an order will be returned. That would be data leakage: the model would be learning from the answer key.
Split the data into training and test sets.
```
from sklearn.model_selection import train_test_split

X = model_df[features]
y = model_df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
```
The training set is for fitting the model. The test set is for evaluation. Keep them separate.

The stratify=y option helps preserve the target balance in both sets. This is useful when one class is much more common than the other.

Build a baseline model.

A baseline gives you something simple to beat. For classification, a dummy model can predict the most frequent class.

from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report

baseline = DummyClassifier(strategy="most_frequent")
baseline.fit(X_train, y_train)

baseline_predictions = baseline.predict(X_test)

print("Baseline accuracy:", accuracy_score(y_test, baseline_predictions))
print(classification_report(y_test, baseline_predictions))

If your real model only barely beats the baseline, it may not be useful yet.

Create preprocessing for numeric and categorical columns.

Scikit-learn models need numbers. Categorical columns such as region and channel must be encoded.

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

numeric_features = ["order_value", "items", "revenue_per_item"]
categorical_features = ["region", "channel", "product_category"]

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

A pipeline keeps preprocessing and modeling together. That reduces mistakes and makes the workflow easier to rerun.

Fit a simple model.

Logistic regression is a reasonable first classification model. It is not always the best model, but it is often a good starting point.

from sklearn.linear_model import LogisticRegression

clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LogisticRegression(max_iter=1000))
    ]
)

clf.fit(X_train, y_train)

Predict and evaluate on the test set.
```
predictions = clf.predict(X_test)

print("Model accuracy:", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
```
Accuracy is easy to understand, but it can be misleading when classes are imbalanced. If only a small share of orders are returned, a model can look accurate by predicting “not returned” most of the time.

Read precision, recall, and F1-score too:
- Precision asks: when the model predicts a return, how often is it right?
- Recall asks: of the actual returns, how many did the model catch?
- F1-score balances precision and recall.
If returns are rare, the default decision threshold may not match your goal. As a next step, compare predicted probabilities at different thresholds, or try LogisticRegression(max_iter=1000, class_weight="balanced"). Tune these choices with a validation approach rather than repeatedly adjusting to the final test set.
Inspect model coefficients carefully.

With logistic regression, coefficients can show which encoded features are associated with higher or lower predicted return risk. They do not prove cause and effect, and they can be affected by preprocessing, correlated features, and small sample sizes.
```
feature_names = clf.named_steps["preprocessor"].get_feature_names_out()

coefficients = (
    pd.Series(
        clf.named_steps["model"].coef_[0],
        index=feature_names
    )
    .sort_values(key=lambda values: values.abs(), ascending=False)
)

coefficients.head(10)
```
Positive coefficients push the model toward predicting returned = 1; negative coefficients push it toward returned = 0. Use this as a diagnostic clue, not as a final explanation.
Check for common beginner mistakes.
- Do not evaluate only on the training set. That tells you how well the model memorized familiar data.
- Do not include columns that reveal the answer after the fact.
- Do not use complex models before building a baseline.
- Do not treat one metric as the whole story.
- Do not ignore whether the prediction would be available early enough to be useful.

A clean modeling workflow is more valuable than a fancy model with unclear inputs.

Turn model results and charts into a clear insight

The final step is to connect the Python output back to a practical question. A chart, a grouped table, and a model metric are not insights by themselves. They become useful when they explain what is happening and suggest what to do next.

For example, suppose your exploration shows that one product category has a higher return rate than others, and the coefficient check shows that product_category, order_value, or channel is strongly associated with the model’s predictions. You might write a short conclusion like this:

Return risk appears higher for certain product categories, especially when order values are larger and orders come through specific channels. The first model performs better than the baseline, but recall is still limited, so it should not be used as an automated decision tool yet. A practical next step is to review product descriptions, sizing or fit information, and fulfillment notes for the highest-return categories, then retrain the model after those fixes are tested.

That conclusion is more useful than saying “the model accuracy is 0.78” and walking away. Metrics need context.

You can also create a small table to support the finding:

return_summary = (
    df.groupby("product_category")
    .agg(
        orders=("order_id", "count"),
        return_rate=("returned", "mean"),
        average_order_value=("order_value", "mean")
    )
    .sort_values("return_rate", ascending=False)
)

return_summary.head(10)

Because returned was normalized to 1 and 0 during cleanup, the mean of that column is the return rate for each group.

Then pair it with a chart:

top_return_categories = return_summary.head(10).reset_index()

plt.figure(figsize=(10, 5))
sns.barplot(
    data=top_return_categories,
    x="return_rate",
    y="product_category"
)
plt.title("Highest Return Rates by Product Category")
plt.xlabel("Return Rate")
plt.ylabel("Product Category")
plt.tight_layout()
plt.show()

A good Python data science workflow moves in a steady order: set up the tools, inspect the data, clean it, explore it, model it, and explain what the result means. Each step protects the next one. If the cleaning is careless, the charts mislead you. If the evaluation is weak, the model looks better than it is. If the conclusion is vague, the work stays trapped in the notebook.

Keep the first version simple. A clear pandas summary, one useful chart, and a baseline scikit-learn model can teach you more than a complicated workflow that nobody can explain.

Data Science in Python: How to Analyze Data, Build Models, and Find Insights