Best MLOps Practices: A Real-World Case Study - Infivit Technologies

How we at Infivit transformed chaotic ML experiments into a production-ready system using modern MLOps practices

The Hidden Crisis in Machine Learning Teams

Machine learning was supposed to make our lives easier. The promise was simply to build smart models, deploy them, and watch as they solve complex problems automatically. But if you’ve worked on a real ML project, you know the reality is far messier. Most ML teams today face a critical problem that has nothing to do with algorithms or model architecture. It’s not about whether to use XGBoost or Neural Networks. The real problem is memory loss—their ML workflow has no ability to remember what worked, what failed, and why.

According to Google’s research on ML technical debt, only 5-10% of a real-world ML system consists of ML code. The remaining 90-95% is infrastructure, data pipelines, and operational systems.

Overview

MLOps Practices and The Sentiment Analysis Project

Six months ago, our team started what seemed like a straightforward project as sentiment analysis on YouTube comments. The goal was to classify comments as positive, negative, or neutral—classic NLP work that any competent data scientist could handle.

We followed the standard playbook:

Collect YouTube comments via API
Clean and preprocess text in Jupyter notebooks
Train various models (Logistic Regression, SVM, BERT)
Save the best model as a .pkl file
Deploy to production

For the first few weeks, everything seemed fine. Our model achieved 78% accuracy on the test set. We showed demos to clients. Everyone was happy.

“Why did accuracy drop from 78% to 65% in production?”

We didn’t know. Was it because the production data was different from our test set? Had we accidentally changed the preprocessing code? Were we even running the same model we tested?

“Which dataset did you use for the model in last Tuesday’s demo?”

Our datasets were saved as comments_cleaned_v2.csv, comments_cleaned_final.csv, and comments_ACTUALLY_final.csv. Nobody could remember which was which.

“Can you reproduce the results from last month’s report?”

Not with any confidence. We had the code in Git, but the data had changed, the preprocessing had evolved, and we couldn’t guarantee we were using the same hyperparameters.

The Root Cause: No System, Just Hope

The problem wasn’t our team’s competence. Everyone knew their stuff—data cleaning, feature engineering, model training, evaluation. The problem was that we had no system for tracking our work.

Our models were saved with names like:

sentiment_model_final.pkl
sentiment_model_final_v2.pkl
sentiment_model_REAL_final.pkl
sentiment_model_use_this_one.pkl

When results changed, we had no way to know if it was because of:

New data added to the training set
Different text preprocessing steps
Changed hyperparameters
A bug in the code

This is exactly why MLOps exists. Not to add complexity, but to add memory and structure to ML workflows.

Understanding MLOps workflow:

MLOps stands for Machine Learning Operations. At its core, it’s about bringing software engineering discipline to machine learning projects.In traditional software development, code is relatively static. Once you deploy a web application, the code doesn’t change unless you push an update. But in ML, data is constantly changing, which means models must continuously learn and adapt. This fundamental difference makes MLOps a distinct discipline.

The Five Pillars of MLOps

After months of research and implementation, we identified five essential pillars that form a complete MLOps system:

1. Version Control for Everything

Not just code, but:

Dataset versions (which data was used)
Model versions (which architecture and weights)
Configuration versions (hyperparameters, preprocessing settings)
Environment versions (library versions, dependencies)

2. Automated Pipelines

Manual execution is error-prone and slow. Automated pipelines ensure:

Data loading and preprocessing run consistently
Model training uses the correct parameters
Evaluation metrics are calculated automatically
Deployment happens smoothly without manual intervention

3. Experiment Tracking

Every training run is logged with complete metadata:

Performance metrics (accuracy, precision, recall, F1)
Hyperparameters used
Training time and computational resources
Dataset identifier

4. Deployment Infrastructure

Models need to run somewhere accessible:

Cloud deployment (AWS, GCP, Azure)
API endpoints for real-time inference
Batch processing capabilities
Containerization (Docker/Kubernetes)

5. Continuous Monitoring

Production models need constant attention:

Performance metrics tracking
Data drift detection
Model drift detection
Automated alerting when issues arise

MLOps is not about having the fanciest tools. It’s about having a system that prevents you from losing track of your work.

When MLflow Wasn’t Enough

After learning about MLOps principles, we implemented MLflow for experiment tracking. This solved part of our problem—we could now see what hyperparameters we used and what metrics we achieved for each training run.

But we still had a massive gap: data versioning.

MLflow could tell us we achieved 82% accuracy with C=1.0 and max_iter=1000, but it couldn’t tell us which version of the cleaned dataset we used. And that turned out to be critical.

The Incident That Changed Everything

One Monday morning, a team member accidentally overwrote processed_data.csv with a new version that used different preprocessing parameters. The file size changed. The number of rows changed. But nobody noticed immediately.

By Wednesday, when we tried to reproduce results from the previous week’s demo, we couldn’t. The data was different, and we had no way to get the old version back.

We had the preprocessing code in Git, but:

The preprocessing parameters had changed
The raw data had been updated with new comments
We couldn’t remember the exact state of everything from last week

That’s when we finally admitted we needed proper data versioning.

What DVC Actually Does (And Why It Matters)

DVC (Data Version Control) is like Git for data and ML pipelines. DVC tracks changes to datasets without storing the actual data in Git. Instead, it stores small metadata files (.dvc files) in Git and keeps the actual data in remote storage (S3, GCS, Azure Blob, or even a local directory).

DVC lets you define your ML pipeline as a series of stages, where each stage explicitly declares its dependencies (input files, parameters) and outputs. When you change a parameter, DVC automatically figures out which stages need to rerun. With DVC, reproducing results from any point in history becomes trivial: check out the Git commit, run dvc pull to fetch the exact data versions, and run dvc repro. Done.

1. Data Versioning with DVC

You start by data versioning your dataset using DVC to ensure reproducibility.

# Initialize DVC in your project
dvc init

# Add raw data to DVC tracking
dvc add data/raw/youtube_comments.csv

# Commit changes to Git
git add data/raw/youtube_comments.csv.dvc .gitignore
git commit -m "Add raw YouTube comments dataset"

# Push data to remote storage (e.g., AWS S3)
dvc remote add -d myremote s3://mybucket/dvcstore
dvc push

2. Training and Experiment Tracking with MLflow

Use MLflow to track your experiments, including parameters, metrics, and models.

# train.py
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

# Load dataset (versioned by DVC)
data = pd.read_csv("data/processed/train.csv")
X = data.drop("sentiment", axis=1)
y = data["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set MLflow experiment
mlflow.set_experiment("sentiment-analysis")

with mlflow.start_run():
    # Define hyperparameters
    C = 1.0
    max_iter = 1000

    # Log parameters
    mlflow.log_param("C", C)
    mlflow.log_param("max_iter", max_iter)

    # Train model
    model = LogisticRegression(C=C, max_iter=max_iter)
    model.fit(X_train, y_train)

    # Predict and evaluate
    preds = model.predict(X_test)
    accuracy = accuracy_score(y_test, preds)

    # Log metrics
    mlflow.log_metric("accuracy", accuracy)

    # Log model
    mlflow.sklearn.log_model(model, "model")

    print(f"Logged model with accuracy: {accuracy}")

Run this script after preparing your data, and MLflow will track your runs.

3. Automating Pipelines with DVC

Define your pipeline stages in dvc.yaml to automate preprocessing, training, and evaluation.

stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - data/raw/youtube_comments.csv
      - src/preprocess.py
    outs:
      - data/processed/train.csv

  train:
    cmd: python train.py
    deps:
      - data/processed/train.csv
      - train.py
    outs:
      - model.pkl

Run the pipeline with:

dvc repro

DVC will rerun only the necessary stages when inputs or parameters change.

4. Model Deployment as an API

Deploy the trained model as a REST API (e.g., using FastAPI or Azure Functions) so the Chrome extension can call it.

# api.py (using FastAPI)
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.sklearn

app = FastAPI()

# Load model from MLflow
model = mlflow.sklearn.load_model("runs:/<run_id>/model")

class TextRequest(BaseModel):
    text: str

@app.post("/analyze")
def analyze_sentiment(request: TextRequest):
    # Preprocess input text (simplified)
    features = preprocess_text(request.text)  # Your preprocessing function
    prediction = model.predict([features])
    sentiment = prediction[0]
    return {"sentiment": sentiment}

Deploy this API to a cloud service and get the endpoint URL.

5. Chrome Extension to Use the API

Manifest (manifest.json):

{
  "manifest_version": 3,
  "name": "Sentiment Analyzer",
  "version": "1.0",
  "permissions": ["activeTab", "scripting"],
  "action": {
    "default_popup": "popup.html"
  },
  "background": {
    "service_worker": "background.js"
  }
}

Popup HTML (popup.html):

<!DOCTYPE html>
<html>
  <head><title>Sentiment Analyzer</title></head>
  <body>
    <button id="analyzeBtn">Analyze Selected Text</button>
    <div id="result"></div>
    <script src="popup.js"></script>
  </body>
</html>

Popup JS (popup.js):

document.getElementById("analyzeBtn").addEventListener("click", () => {
  chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
    chrome.scripting.executeScript({
      target: { tabId: tabs[0].id },
      function: () => window.getSelection().toString()
    }, (selection) => {
      const text = selection[0].result;
      if (!text) {
        alert("Please select some text on the page.");
        return;
      }
      fetch("https://your-api-url/analyze", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text: text })
      })
      .then(res => res.json())
      .then(data => {
        document.getElementById("result").textContent = `Sentiment: ${data.sentiment}`;
      })
      .catch(err => {
        document.getElementById("result").textContent = "Error analyzing sentiment.";
      });
    });
  });
});

6. Monitoring and Retraining

Set up monitoring on your API to track usage and model performance. If accuracy drops or data drifts, trigger retraining by updating your data, running the pipeline, and redeploying the model.

Immediate Benefits We Noticed

1. Experimentation Speed Increased 5x

Before MLOps, running a new experiment meant manually executing 5-7 scripts in the right order, each time risking a mistake. With DVC, we just change parameters and run one command. Team members went from running 2-3 experiments per day to 10-15.

2. Onboarding Time Dropped from 3 Days to 2 Hours

New team members used to spend days figuring out which notebooks to run and which datasets to use. Now they clone the repo, run dvc pull, and have everything they need. The pipeline is self-documenting.

3. Zero Reproduction Failures

Before, when client asked us to reproduce results from a demo or report, we succeeded maybe 60% of the time. With DVC, we have a 100% success rate. Checkout the commit, pull the data, run the pipeline. Done.

4. Production Confidence Skyrocketed

We know exactly which model is in production, what data it was trained on, and what metrics it achieved. When performance drops, we can compare current performance against historical baselines and identify the exact change that caused the issue.

Lessons Learned

We didn’t implement every MLOps practice on day one. We started with experiment tracking (MLflow), then added data versioning (DVC), then automation. Build incrementally based on your biggest pain points. Most of our failures came from data problems, not code bugs. Versioning data is just as important as versioning code—maybe more so. Every manual step is an opportunity for mistakes. Automate everything that you run more than once. Our dvc.yaml and params.yaml files serve as living documentation. Team members can understand the entire pipeline just by reading these two files.

Final Thoughts:

MLOps sounds intimidating. The term conjures images of massive infrastructure, dedicated DevOps teams, and enterprise-grade tools costing thousands of dollars per month.But that’s not what MLOps is really about.

MLOps is about having a system. A system for tracking your work. A system for versioning your data. A system for automating repetitive tasks. A system for deploying models reliably.

You can start with free, open-source tools:

DVC for data versioning and pipelines
MLflow for experiment tracking
Git/GitHub for code versioning
Docker for containerization

Don’t try to build the perfect MLOps system from day one. Build it incrementally, one pain point at a time.

Machine learning should make work easier, not harder. MLOps is what makes that promise real.

Resources:

Mlflow : https://mlflow.org/docs/latest/ml/
DVC : https://doc.dvc.org/