Course: MSBA Social Media Analytics, Spring 2026 Team: Alisha Surabhi, Shivangi Gupta, Simoni Dalal, Rohan Chimne, Radha Pawar, Justin Yang
The starting point for this project was a frustration with how "influencer" gets defined in practice. Most tools and most brand briefs use follower count as the primary metric. It's intuitive, it's easy to measure, and it's mostly wrong.
A user with 80,000 followers who gets 200 retweets per post and whose content doesn't travel outside their immediate network has less actual influence than a user with 4,000 followers who sits at the junction between five different topic communities. The second person is a bridge — when they share something, it reaches people who would never have encountered it otherwise. The first person is just popular within a bubble.
This project is about measuring that distinction at scale, using network structure rather than raw popularity metrics.
We took a Twitter interaction dataset and built a directed graph where users are nodes and retweets/mentions/replies are edges. Then for every user in the graph, we computed four centrality measures:
- Degree centrality — how many direct connections a user has. This is roughly what follower count captures.
- Betweenness centrality — how often a user appears on the shortest path between other pairs of users. High betweenness = bridge node.
- Closeness centrality — how quickly a user can reach the rest of the network in terms of graph distance.
- Eigenvector centrality — whether a user is connected to other well-connected users. Being known by influential people matters more than being known by many people.
We then framed influencer identification as a binary classification problem and trained a logistic regression model using these four centrality measures as features.
Accuracy came out around 84%, which is reasonable. But the more interesting result was which features drove that accuracy.
Betweenness centrality was the strongest predictor by coefficient magnitude — by a significant margin over degree centrality (follower count). The users the model was most confident about labeling as influencers were overwhelmingly the bridge nodes — people who connect otherwise-disconnected communities — rather than the nodes with the most followers.
For brand strategy that's a fairly direct implication: if you're allocating influencer marketing budget based on follower count, you're probably missing the people who would actually spread your message furthest.
| Metric | Score |
|---|---|
| Accuracy | ~84% |
| Precision | ~81% |
| Recall | ~79% |
pip install pandas numpy scikit-learn matplotlib seaborn networkx
jupyter notebook Assignment_1_SMA.ipynbtweets_sample.csv needs to be in the same directory.
Assignment_1_SMA.ipynb— the full analysis, end to endtweets_sample.csv— the tweet/interaction datasetMSBA SMA S2026 Assignment 1.docx— original assignment brief
The obvious next step is moving beyond binary classification to a ranked influence score that could directly power an outreach prioritization tool. You'd score every user in a brand's potential influencer pool, rank by weighted centrality combination (betweenness-heavy weighting based on this analysis), and surface the top candidates. The logistic regression gives you a probability that could serve as that score directly.
There's also a temporal dimension we didn't explore — centrality isn't static. A user's betweenness centrality can shift significantly as conversations evolve around new topics. Building a time-series version of this analysis would be a more realistic representation of how influence actually works.