How we at Infivit transformed chaotic ML experiments into a production-ready system using modern MLOps practices

The Hidden Crisis in Machine Learning Teams
Machine learning was supposed to make our lives easier. The promise was simply to build smart models, deploy them, and watch as they solve complex problems automatically. But if you’ve worked on a real ML project, you know the reality is far messier. Most ML teams today face a critical problem that has nothing to do with algorithms or model architecture. It’s not about whether to use XGBoost or Neural Networks. The real problem is memory loss—their ML workflow has no ability to remember what worked, what failed, and why.
According to Google’s research on ML technical debt, only 5-10% of a real-world ML system consists of ML code. The remaining 90-95% is infrastructure, data pipelines, and operational systems.
Overview
MLOps Practices and The Sentiment Analysis Project
Six months ago, our team started what seemed like a straightforward project as sentiment analysis on YouTube comments. The goal was to classify comments as positive, negative, or neutral—classic NLP work that any competent data scientist could handle.
We followed the standard playbook:
- Collect YouTube comments via API
- Clean and preprocess text in Jupyter notebooks
- Train various models (Logistic Regression, SVM, BERT)
- Save the best model as a .pkl file
- Deploy to production
For the first few weeks, everything seemed fine. Our model achieved 78% accuracy on the test set. We showed demos to clients. Everyone was happy.
“Why did accuracy drop from 78% to 65% in production?”
We didn’t know. Was it because the production data was different from our test set? Had we accidentally changed the preprocessing code? Were we even running the same model we tested?
“Which dataset did you use for the model in last Tuesday’s demo?”
Our datasets were saved as comments_cleaned_v2.csv, comments_cleaned_final.csv, and comments_ACTUALLY_final.csv. Nobody could remember which was which.
“Can you reproduce the results from last month’s report?”
Not with any confidence. We had the code in Git, but the data had changed, the preprocessing had evolved, and we couldn’t guarantee we were using the same hyperparameters.
The Root Cause: No System, Just Hope
The problem wasn’t our team’s competence. Everyone knew their stuff—data cleaning, feature engineering, model training, evaluation. The problem was that we had no system for tracking our work.
Our models were saved with names like:
- sentiment_model_final.pkl
- sentiment_model_final_v2.pkl
- sentiment_model_REAL_final.pkl
- sentiment_model_use_this_one.pkl
When results changed, we had no way to know if it was because of:
- New data added to the training set
- Different text preprocessing steps
- Changed hyperparameters
- A bug in the code
This is exactly why MLOps exists. Not to add complexity, but to add memory and structure to ML workflows.
Understanding MLOps workflow:
MLOps stands for Machine Learning Operations. At its core, it’s about bringing software engineering discipline to machine learning projects.In traditional software development, code is relatively static. Once you deploy a web application, the code doesn’t change unless you push an update. But in ML, data is constantly changing, which means models must continuously learn and adapt. This fundamental difference makes MLOps a distinct discipline.
The Five Pillars of MLOps
After months of research and implementation, we identified five essential pillars that form a complete MLOps system:
1. Version Control for Everything
Not just code, but:
- Dataset versions (which data was used)
- Model versions (which architecture and weights)
- Configuration versions (hyperparameters, preprocessing settings)
- Environment versions (library versions, dependencies)
2. Automated Pipelines
Manual execution is error-prone and slow. Automated pipelines ensure:
- Data loading and preprocessing run consistently
- Model training uses the correct parameters
- Evaluation metrics are calculated automatically
- Deployment happens smoothly without manual intervention
3. Experiment Tracking
Every training run is logged with complete metadata:
- Performance metrics (accuracy, precision, recall, F1)
- Hyperparameters used
- Training time and computational resources
- Dataset identifier
4. Deployment Infrastructure
Models need to run somewhere accessible:
- Cloud deployment (AWS, GCP, Azure)
- API endpoints for real-time inference
- Batch processing capabilities
- Containerization (Docker/Kubernetes)
5. Continuous Monitoring
Production models need constant attention:
- Performance metrics tracking
- Data drift detection
- Model drift detection
- Automated alerting when issues arise
MLOps is not about having the fanciest tools. It’s about having a system that prevents you from losing track of your work.
When MLflow Wasn’t Enough
After learning about MLOps principles, we implemented MLflow for experiment tracking. This solved part of our problem—we could now see what hyperparameters we used and what metrics we achieved for each training run.
But we still had a massive gap: data versioning.
MLflow could tell us we achieved 82% accuracy with C=1.0 and max_iter=1000, but it couldn’t tell us which version of the cleaned dataset we used. And that turned out to be critical.
The Incident That Changed Everything
One Monday morning, a team member accidentally overwrote processed_data.csv with a new version that used different preprocessing parameters. The file size changed. The number of rows changed. But nobody noticed immediately.
By Wednesday, when we tried to reproduce results from the previous week’s demo, we couldn’t. The data was different, and we had no way to get the old version back.
We had the preprocessing code in Git, but:
- The preprocessing parameters had changed
- The raw data had been updated with new comments
- We couldn’t remember the exact state of everything from last week
That’s when we finally admitted we needed proper data versioning.
What DVC Actually Does (And Why It Matters)
DVC (Data Version Control) is like Git for data and ML pipelines. DVC tracks changes to datasets without storing the actual data in Git. Instead, it stores small metadata files (.dvc files) in Git and keeps the actual data in remote storage (S3, GCS, Azure Blob, or even a local directory).
DVC lets you define your ML pipeline as a series of stages, where each stage explicitly declares its dependencies (input files, parameters) and outputs. When you change a parameter, DVC automatically figures out which stages need to rerun. With DVC, reproducing results from any point in history becomes trivial: check out the Git commit, run dvc pull to fetch the exact data versions, and run dvc repro. Done.

1. Data Versioning with DVC
You start by data versioning your dataset using DVC to ensure reproducibility.
# Initialize DVC in your project
dvc init
# Add raw data to DVC tracking
dvc add data/raw/youtube_comments.csv
# Commit changes to Git
git add data/raw/youtube_comments.csv.dvc .gitignore
git commit -m "Add raw YouTube comments dataset"
# Push data to remote storage (e.g., AWS S3)
dvc remote add -d myremote s3://mybucket/dvcstore
dvc push
2. Training and Experiment Tracking with MLflow
Use MLflow to track your experiments, including parameters, metrics, and models.
# train.py
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
# Load dataset (versioned by DVC)
data = pd.read_csv("data/processed/train.csv")
X = data.drop("sentiment", axis=1)
y = data["sentiment"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Set MLflow experiment
mlflow.set_experiment("sentiment-analysis")
with mlflow.start_run():
# Define hyperparameters
C = 1.0
max_iter = 1000
# Log parameters
mlflow.log_param("C", C)
mlflow.log_param("max_iter", max_iter)
# Train model
model = LogisticRegression(C=C, max_iter=max_iter)
model.fit(X_train, y_train)
# Predict and evaluate
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)
# Log metrics
mlflow.log_metric("accuracy", accuracy)
# Log model
mlflow.sklearn.log_model(model, "model")
print(f"Logged model with accuracy: {accuracy}")
Run this script after preparing your data, and MLflow will track your runs.
3. Automating Pipelines with DVC
Define your pipeline stages in dvc.yaml to automate preprocessing, training, and evaluation.
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- data/raw/youtube_comments.csv
- src/preprocess.py
outs:
- data/processed/train.csv
train:
cmd: python train.py
deps:
- data/processed/train.csv
- train.py
outs:
- model.pkl
Run the pipeline with:
dvc repro
DVC will rerun only the necessary stages when inputs or parameters change.
4. Model Deployment as an API
Deploy the trained model as a REST API (e.g., using FastAPI or Azure Functions) so the Chrome extension can call it.
# api.py (using FastAPI)
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.sklearn
app = FastAPI()
# Load model from MLflow
model = mlflow.sklearn.load_model("runs:/<run_id>/model")
class TextRequest(BaseModel):
text: str
@app.post("/analyze")
def analyze_sentiment(request: TextRequest):
# Preprocess input text (simplified)
features = preprocess_text(request.text) # Your preprocessing function
prediction = model.predict([features])
sentiment = prediction[0]
return {"sentiment": sentiment}
Deploy this API to a cloud service and get the endpoint URL.
5. Chrome Extension to Use the API
Manifest (manifest.json):
{
"manifest_version": 3,
"name": "Sentiment Analyzer",
"version": "1.0",
"permissions": ["activeTab", "scripting"],
"action": {
"default_popup": "popup.html"
},
"background": {
"service_worker": "background.js"
}
}
Popup HTML (popup.html):
<!DOCTYPE html>
<html>
<head><title>Sentiment Analyzer</title></head>
<body>
<button id="analyzeBtn">Analyze Selected Text</button>
<div id="result"></div>
<script src="popup.js"></script>
</body>
</html>
Popup JS (popup.js):
document.getElementById("analyzeBtn").addEventListener("click", () => {
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
chrome.scripting.executeScript({
target: { tabId: tabs[0].id },
function: () => window.getSelection().toString()
}, (selection) => {
const text = selection[0].result;
if (!text) {
alert("Please select some text on the page.");
return;
}
fetch("https://your-api-url/analyze", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text: text })
})
.then(res => res.json())
.then(data => {
document.getElementById("result").textContent = `Sentiment: ${data.sentiment}`;
})
.catch(err => {
document.getElementById("result").textContent = "Error analyzing sentiment.";
});
});
});
});
6. Monitoring and Retraining
Set up monitoring on your API to track usage and model performance. If accuracy drops or data drifts, trigger retraining by updating your data, running the pipeline, and redeploying the model.

Immediate Benefits We Noticed
1. Experimentation Speed Increased 5x
Before MLOps, running a new experiment meant manually executing 5-7 scripts in the right order, each time risking a mistake. With DVC, we just change parameters and run one command. Team members went from running 2-3 experiments per day to 10-15.
2. Onboarding Time Dropped from 3 Days to 2 Hours
New team members used to spend days figuring out which notebooks to run and which datasets to use. Now they clone the repo, run dvc pull, and have everything they need. The pipeline is self-documenting.
3. Zero Reproduction Failures
Before, when client asked us to reproduce results from a demo or report, we succeeded maybe 60% of the time. With DVC, we have a 100% success rate. Checkout the commit, pull the data, run the pipeline. Done.
4. Production Confidence Skyrocketed
We know exactly which model is in production, what data it was trained on, and what metrics it achieved. When performance drops, we can compare current performance against historical baselines and identify the exact change that caused the issue.
Lessons Learned
We didn’t implement every MLOps practice on day one. We started with experiment tracking (MLflow), then added data versioning (DVC), then automation. Build incrementally based on your biggest pain points. Most of our failures came from data problems, not code bugs. Versioning data is just as important as versioning code—maybe more so. Every manual step is an opportunity for mistakes. Automate everything that you run more than once. Our dvc.yaml and params.yaml files serve as living documentation. Team members can understand the entire pipeline just by reading these two files.
Final Thoughts:
MLOps sounds intimidating. The term conjures images of massive infrastructure, dedicated DevOps teams, and enterprise-grade tools costing thousands of dollars per month.But that’s not what MLOps is really about.
MLOps is about having a system. A system for tracking your work. A system for versioning your data. A system for automating repetitive tasks. A system for deploying models reliably.
You can start with free, open-source tools:
- DVC for data versioning and pipelines
- MLflow for experiment tracking
- Git/GitHub for code versioning
- Docker for containerization
Don’t try to build the perfect MLOps system from day one. Build it incrementally, one pain point at a time.
Machine learning should make work easier, not harder. MLOps is what makes that promise real.
Resources:
- Mlflow : https://mlflow.org/docs/latest/ml/
- DVC : https://doc.dvc.org/