Protecting Privacy by Sharing Secrets
Using machine learning and a cryptographic technique called 'secret sharing,' UW Tacoma researchers have developed a way to screen social media content for hate-speech while protecting the privacy of users.
You unlock your smartphone, open a browser and search the internet for something, say a pair of shoes, a new blender, concert tickets or even textbooks. Later, you’re on a social media site or reading an article when you notice an advertisement for the thing you were looking at earlier. The idea that data follows us around online has become commonplace, the price of admission for access to the global marketplace of goods and information. However, the practice raises concerns about what companies are doing with your data and how they got it in the first place.
Enter UW Tacoma Professor Martine De Cock and Associate Professor Anderson Nascimento. “We are looking to address problems that are socially relevant, where we can get the benefits of machine learning and artificial intelligence, and do that in a way that doesn’t violate a person’s privacy,” said Nascimento.
One aspect of De Cock and Nascimento’s larger research involves using machine learning to detect misogynistic tweets or those that contain hate speech directed at immigrants. “There is a lot of this material that appears in tweets or on Facebook and it’s becoming a real issue,” said De Cock. “It’s very difficult for social media companies to do proper moderation, simply because of the sheer amount of content that is published every day.”
As part of their work, De Cock and Nascimento have created their own techniques for identifying tweets that contain misogynistic language or hate speech. They’ve also examined other approaches to addressing this issue. “What we’ve been able to show is that when you have limited label data, instead of going for this powerful, data-hungry technique called deep learning, you can use other techniques and then you can enhance these other techniques with expert knowledge,” said De Cock.
Before going forward in this story, it’s important to take a step back. Machine learning is one branch within the larger tree of artificial intelligence. De Cock and Nascimento are engaged with natural language processing which is one form of machine learning. “Essentially, we’re interested in using artificial intelligence techniques that try to grasp meaning or categorize natural language into some specific, preassigned categories—this tweet is misogynistic or that one is offensive toward immigrants,” said Nascimento.
The process starts with tweets that have been marked by human annotators as misogynistic or hateful toward immigrants. “These examples are given to the computer as training examples to learn from, in a process called supervised learning,” said De Cock. “The level of additional guidance we give to the computer during the training process varies widely. You could do something called deep learning – which is very powerful – where you give very few hints and expect that the program will figure it out.”
What De Cock and Nascimento found is that when less data is available to learn from — in this case 10,000 tweets — it’s possible to guide the system with expert knowledge and achieve success. “This hybrid system that combines machine learning with data supplemented by people performed the best,” said Nascimento.
Here is where De Cock and Nascimento are working to break new ground. “Currently, these machine-learning algorithms have to see the data in order to come up with a classification, they have to see the tweet to say whether it’s misogynistic or not,” said Nascimento. “We are developing new techniques that will allow us to process text in a privacy-preserving way.”
The work involves breaking down an encrypted message into pieces or “secret shares.” From there, the information is processed by different computers. Those computers relay encrypted computations based on those secret shares to each other. “The final result of the computations is only revealed when in the end, all the resulting encrypted pieces are purposefully put together,” said De Cock.
De Cock and Nascimento and a team of computer science and systems students at UW Tacoma are pressing ahead with the idea. “We actually have published work where, given text and a photo, we can infer the personality, age and gender of a person without actually seeing the information,” said De Cock.
De Cock and Nascimento see a broad range of potential possibilities. “What if someone could find out if they’re at risk for diabetes or Alzheimer’s but the entity doing the analysis never actually saw your data in the first place,” said Nascimento. “We already have a patent with a startup in Seattle where they apply privacy-preserving mechanisms to health care data.”
In the future when you unlock your smart phone and open up a browser the things you searched for may indeed still follow you around. However, if the techniques being developed by De Cock and Nascimento come to fruition, the companies using the data will not know who you are. “It’s kind of a win-win scenario,” said De Cock.