There have been a number of high-profile criminal cases that were solved using the DNA that family members of the accused placed in public databases. One lesson there is that our privacy isn’t entirely under our control; by sharing DNA with you, your family has the ability to choose what everybody else knows about you.
Now, some researchers have demonstrated that something similar is true about our words. Using a database of past tweets, they were able to effectively pick out the next words a user was likely to use. But they were able to do so more effectively if they simply had access to what a person’s contacts were saying on Twitter.
Entropy is inescapable
The work was done by three researchers at the University of Vermont: James Bagrow, Xipei Liu, and Lewis Mitchell. It centers on three different concepts relating to the informational content of messages on Twitter. The first is the concept of entropy, which in this context describes how many bits are, on average, needed to describe the uncertainty about future word choices. One way of looking at this is that, if you’re certain the next word will be chosen from a list of 16, then the entropy will be four (24 is 16). The average social media user has a 5,000-word vocabulary, so choosing at random from among that would be an entropy of a bit more than 12. They also considered the perplexity, which is the value that arises from the entropy—16 in the example we just used where the entropy is four.
The final concept they used is called predictability, which is simply the probability of accurately predicting the next word used.
To see how these concepts worked in the world of social media, the researchers turned to a database of about 14,000 Twitter users who collectively produced more than 30 million tweets. Within this, they identified 927 users and the 15 users each of them most frequently interacted with. Their history of those interactions were ingested into an algorithm that measured the predictability of future word use, given what had happened in the past.
In general, people were fairly predictable. Most of these 927 users clustered in the area of an entropy between 5.5 and eight bits, meaning the next word is typically found in a list of between 45 and 256 words. Then, they chose the user that the person most frequently interacted with. The cross-user entropy turned out to typically be in the range of six to 12 bits. The high end of that range is roughly equivalent to picking random words, but the low end is well below random, corresponding to the word being found within a list of 64. Put differently, a user’s own history gave a predictability of 40-70 percent, while their friend’s history provided a predictability of zero to 60 percent.
So predictable
But most users interact with a variety of people online, and it could be that some interactions are more relevant than others. So, the authors continued to add interacting users and found that each improved the predictability (or, put differently, lowered the entropy). By the ninth interacting user, the entropy was actually lower than it was when it was generated using the user’s own words. In other words, knowing what your friends have said made you more predictable than knowing what you’d said. The drop in entropy continued up to the 15 user limit they’d set for the work.
That’s not to say your friends know you better than you know yourself. Instead, if you include a user and their contacts, then you can boost the predictability even more.
The authors figured that some of this might be a product of language structure. So they mixed up interacting users, linking them to people they hadn’t interacted with. This cut the predictability dramatically, indicating language wasn’t everything. In a similar way, they brought in unrelated tweets that were made at the same time to confirm that the predictability wasn’t simply a product of people talking about topical subjects that were trending at the time.
The authors next analyzed whether a user’s behavior on Twitter predicted much about how predictable they were. People who posted regularly—eight or more tweets a day—tended to be more predictable. In addition, their connected users who were active at a similar level didn’t contribute much to predictions, as they were often tweeting about unrelated things. And a stronger social tie (as measured by how many connections users had) tended to mean a stronger contribution to predictability.
If a connected user frequently initiated contact with the key user, then that connection enhanced predictability. But if the central user was the one making contact, then it didn’t. This suggests part of the key to predictability may be that a given tweet comes in response to some prompt from a connection.
You can never leave
This has some obvious implications for privacy. If a person leaves a social network, but their history remains (as is the case with Twitter, the one analyzed here), then it should be possible to reconstruct their social network and analyze it to get some understanding of the person who has tried to become more anonymous. In addition, if you can reconstruct a person’s offline relationships and find them on social media, then it’s possible you could learn something about a person who has never joined the service. As the authors of the paper describe it, “If an individual forgoes using a social media platform or deletes their account, yet their social ties remain, then that platform owner potentially still possesses 95.1 ± 3.36% of the achievable predictive accuracy of the future activities of that individual.”
The companies that offer these social media services are obviously in a better position to analyze these networks. So, for example, Facebook could potentially infer the existence of a sibling that has never joined and then build up a profile of what that person’s posts were likely to sound like.
But there are definitely limits here. This doesn’t indicate that we can predict much about a person other than their more probable social media posts, more specifically responses to the social media posts of their connections. That’s pretty far from Minority Report-like predictability. But, given that everyone from marketers to Russian intelligence agencies seem to be interested in figuring out users’ social media proclivities, the finding that you don’t even have to be on social media to have them draw inferences isn’t especially comforting.