Research

Working paper

Detecting Bots and Assessing Their Impact in Social Networks

Online social networks are often subject to manipulation by malicious actors through the use of automated accounts known as bots. We consider the problem of detecting bots in online social networks and assessing their impact on the opinions of individuals. We begin by analyzing the behavior of bots in social networks and identify that they exhibit heterophily. We use this to develop a detection algorithm based on the Ising model from statistical physics. We show that this Ising model algorithm can identify bots with higher accuracy than state of the art methods. We then develop a framework to estimate the impact bots have on the opinions of users in social networks. Our framework is based on opinion dynamics models and how the activity level and network connectivity of the bots shift equilibrium opinions. Analysis on several real social networks shows limited number of bots can cause non-trivial shifts in the population opinions.

Working paper

Optimizing Opinions with Stubborn Agents Under Time-Varying Dynamics

We consider optimizing the placement of stubborn agents in a social network in order to maximally influence the population. We assume individuals in a directed social network each have a latent opinion that evolves over time in response to social media posts by their neighbors. The individuals randomly communicate noisy versions of their latent opinion to their neighbors, causing them to update their opinions using a time-varying update rule that has them become more stubborn with time and be less affected by new posts. The noisy communicated opinion and dynamic update rule are novel components of our model and they reflect realistic behaviors observed in many psychological studies.

We prove that under fairly general conditions, the opinions converge to an equilibrium in the presence of stubborn agents. What is surprising about this result is that the equilibrium condition depends only upon the network structure and the identity of the stubborn agents. The time-varying opinion update rules, which are heterogeneous across individuals, do not affect the equilibrium. We also prove bounds on the rate of convergence to this equilibrium.

We then use this model to develop a discrete optimization formulation for the problem of maximally shifting the equilibrium opinions in a network by targeting users with stubborn agents. We consider maximizing the mean opinion and also maximizing the number of individuals whose opinion exceeds a fixed threshold. We show that the mean opinion is a monotone submodular function, allowing us to find a good solution using a greedy algorithm. We find that on real social networks in Twitter consisting of tens of thousands of individuals, a small number of stubborn agents can non-trivially influence the equilibrium opinions. Furthermore, we show that our greedy algorithm outperforms several common benchmarks.

Operations Research

Finding Extremists in Online Social Networks

Extremist groups such as ISIS use social media to recruit, spread propaganda, and incite violence. How can one monitor social networks to shut down these users? Our work develops simple machine learning methods to predict if new users will belong to existing extremist groups using their network neighborhoods. We also develop a network search model based on Polya’s Urns that lets one monitor the network to see if suspended users return with new accounts. We tested our method on real Twitter accounts of ISIS members and found we can find the returning users much faster than other search methods.This work is done with Lieutenant Colonel Christopher Marks.

OPERATIONS RESEARCH

Finding Rumor Sources on Random Trees

Imagine a rumor spreads in a network and all we know is the source. Then can we find the source using only the network structure? It turns out that we can and there is a network centrality we created called rumor centrality which does this. For regular tree networks under a certain rumor spreading model it is an exact maximum likelihood estimator for the source. We have proven that it performs well on essentially any sparse network, showing the rumor centrality is a universally good source estimator. Rumor centrality is also a rather interesting mathematical object, having connections with linear extensions of partial orders, random walks, Markov chains, and Nash equilibria of certain network games.Related publications can be found here.This work was done in collaboration with Devavrat Shah.The figure on the right shows how rumor centrality finds the source. A simulated rumor was spread on this network, and the each node’s normalized rumor centrality was calculated, with red=1 and blue = 0. As can be seen, rumor centrality narrows down the set of likely rumor sources. In this figure, the true source is the red node in the center.

Working paper

Building a Location-Based Set of Social Media Users

We present an algorithm to build a set of social media users starting from a small seed set. Our method uses an expand-classify approach where we classify the set of users as in the target location or not, and then expands the set by collecting the neighbors of the people in the target location. Our classification algorithm uses a factor graph model and we show that classification can be easily done by solving a min-cut problem. We find that we can collect several thousand users from Twitter in a few hours. Testing on geo-tagged content, we find that our method is fairly accurate.

ANNALS OF APPLIED STATISTICS

A Bayesian Approach for Predicting the Popularity of Tweets

We developed a probabilistic model for the spread of an individual tweet in Twitter. Our key observation is that the retweet times follow a log-normal distribution. We use this in a hierarchical Bayesian model to forecast the total number of retweets a tweet receives. We are able to achieve errors below 30% using only 5 minutes of observed retweets.

Working paper

Picking Winners: A Data Driven Approach to Evaluating the Quality of Startup Companies

We develop a data-driven approach to build portfolios of startup companies. Our framework involves building portfolios to maximize the probability of at least one company “winning” which means having an IPO or an acquisition. We call this the “Picking Winners” portfolio. We develop a model based on first passage times random walks for the companies’ funding round time series. We train our model on data for 24,000 companies. We find with our approach we can achieve higher exit rates than many top venture capital firms.

Media

The Wallstreet Journal

Data-Driven Portfolios Power ‘Home-Run’ Exits in MIT StudyWorking paper

Picking Winners in Daily Fantasy Sports Using Integer Programming

We develop an integer programming formulation for picking entries into daily fantasy sports contests on sites like DraftKings. Our “Picking Winners” portfolio of entries aims to have one entry achieve a huge number of points and win the contest. This is because nearly all the money goes to the winner. Our algorithm creates lineups with high variance (known as stacking) and then forces the lineups to not share too many entries (diversify the portfolio). We were able to win multiple contests in hockey and baseball using our approach.

Working paper

Leaders, Followers, and Community Detection

What defines a community in a social network? Is it the community’s leaders? Or is it the less important followers in a community? We have found that these followers actually provide the identifiable signal for a community in a network. These followers have no friends outside of their community, and in this way provide the community with an identity. Using this observation, we created the leader-follower algorithm (LFA) which finds communities in networks by searching for these followers. The LFA has linear run-time and so can be applied to incredibly large networks. It can find overlapping communities and learns the number of communities naturally from the network structure. We have also proven that the LFA has good performance on a wide class of networks. These strengths give the LFA and advantage over other community detection methods such as spectral clustering or inference based methods. The figure on the right shows the communities found by the LFA in my own Facebook network, with appropriate social labels for each community found.

Working paper

Predicting Performance Under Stress Using Electrodermal Activity

We show that you can predict who will choke under pressure using biometric data measured in low stress environments. We conduct an experiment where subjects perform arithmetic questions for money. They answer the questions in low stress and high stress rounds. High stress is induced by time constraints and higher monetary reward. We measure their electrodermal activity (EDA), which is a measure of how much they sweat. EDA is a known physiological indicator of mental arousal. Basically, if your mind is aroused, you sweat. The arousal can be due to stress or excitement. We find that subjects whose EDA decreases in the low stress round do poorly in the high stress round. Unlike other studies in this area, ours is truly predictive, in that we forecast a future outcome using historical biometric data. Our results suggest the potential of biometric data for job training and hiring processes.

Working paper

Learning User Preferences and Engagement Using Choice and Time Data

What can how fast you swipe on a face in Tinder tell us about how much you like (or dislike) that face? Turns out quite a bit. We develop a “choice-time-engagement” model that uses choice and time data to infer how much you like something, and how engaged you are with the decision you make. Think about apps like Tinder. A lot of times, people aren’t engaged and swipe blindly. It would be nice to measure this engagement, and maybe count their swipes less when trying to figure out the quality of users on the site. Our choice-time-engagement model can denoise online survey data using choice and time data to get better estimates of true user preferences. This is useful for not only dating apps, but nearly any app where people vote on stuff.To demonstrate that our model works, we tested it on some online surveys. One is a Mathematicians survey, where the subjects vote on their preferred mathematicians in head to head comparisons. The resulting response times and preferences are shown in the figure on the left. As can be seen, Newton vs. Pascal is a fast decision, with Newton winning most of the time. On the other hand, Newton vs Gauss takes 4.3 sec to decide, and its nearly 50/50.