BizLens: Startup Success Evaluator
Developed an end-to-end machine learning pipeline to rigorously test the viability of predicting startup success from a public dataset. This project involved advanced data cleansing, feature engineering, and the systematic evaluation of a Random Forest regression model using Python and Scikit-learn, all within AI4ALL’s cutting-edge AI4ALL Ignite accelerator. Our primary finding was a critical analysis of the dataset’s integrity, demonstrating a mature, scientific approach to data validation.
Problem Statement
Given the recent surge in new business ventures and the historically high failure rate of startups, entrepreneurs lack reliable, data-driven tools to assess their potential for success. This project was motivated by the need to move beyond anecdotal evidence and create a model that could identify the key drivers of success, providing actionable insights to founders in the critical early stages of their companies.
- Source Code for Backend/Frontend: Github Repo Link
- Presentation: BizLens_Startup_Success_Evaluator.pdf
- Portfolio: Link
Key Results
-
Initial Data Validation Failure
Our initial analysis on a global startup dataset revealed a critical flaw. After training multiple models, we consistently received negative R-squared values, leading us to conclude that the provided Success Score was statistically random and not correlated with the features. -
Project Pivot and Metric Engineering
We pivoted to a new, more reliable dataset, which presented a new challenge: it lacked a success metric. We engineered a new, defensible Success Score based on clear outcome variables (like IPO status, acquisition, and valuation), allowing us to build a meaningful model while avoiding data leakage. -
Successful Model Development
Using the new dataset and our engineered success metric, we successfully trained a Random Forest Regressor. The final model achieved a positive R-squared score of 0.6826, demonstrating a moderately strong predictive relationship between a startup’s characteristics and its potential for success. -
Output Transformation for User Experience
We observed that the model’s raw predictions were clustered in a narrow range (1.0 to 1.5). To provide a more intuitive output for the end-user, we implemented a final transformation to scale the raw score to a clear 0–1 range, where 0 represents lower potential and 1 represents higher potential.
Methodologies
- Extensive data preparation pipeline using Pandas:
- One-hot encoding categorical variables
- Engineering features like Startup Age and financial efficiency ratios
- Log-transformations to handle heavily skewed data
- Outlier detection using Interquartile Range (IQR) method
- Modeling with Scikit-learn:
- Trained a
RandomForestRegressor - Used 5-fold cross-validation for robust evaluation
- Trained a
Data Sources
- Kaggle Dataset: Startup Investments Crunchbase
Technologies Used
- Python
- Pandas
- NumPy
- Scikit-learn
- Matplotlib & Seaborn
- Plotly Express
Authors
This project was completed in collaboration with:
- Ebyan Jama: University of Minnesota
- Elisa Yu: Tufts University
- Sarah Toussaint: New York University
- Shirina Daniel: Florida International University
- Victor Olivo: Rutgers University