A Weekend of Data Transparency

This past weekend, I had the privilege of traveling to New York City for The Wall Street Journal’s Data Transparency Weekend. This was a codeathon whose theme was the development of tools to allow people — ranging from journalists and technical professionals to ordinary users — to better understand how their data are collected and used in the course of using the Internet.

Data Transparency weekend was the brainchild of the team, led by Julia Angwin, that produces the WSJ’s popular What They Know series. They managed to bring together around 100 of us to see what we can do to advance the state of the art in three general areas: Scanning, Education, and Control. The weekend began on Friday evening with a dinner gathering, followed by some short talks to inspire us. Those of us with project ideas then gave 30 second pitches to form our project teams. I proposed an idea around investigating the possible use of HTML5 local storage for tracking purposes, which in retrospect was a little too narrow, and didn’t attract much interest. But Ed Felten, who was organizing another project, tapped me on the shoulder and suggested that my idea had some synergy with his, so I joined his project team.

For those of you that don’t know Ed Felten, he’s a highly respected professor at Princeton who is currently serving as Chief Technologist at the Federal Trade Commission. It was amazing to work with/for Ed, not so much because of his stature and celebrity but because he’s just a great project organizer and leader. Our project was called the Tracking Report Card, an effort to summarize and display for users the extent to which they’re tracked by third parties (such as advertisers) on frequently-used websites, and in particular those that don’t honor do-not-track and other opt-out measures. By the end of the evening Friday, we had a general block diagram of the project and tasks assigned to each of us. My main task was to write an engine to recognize which of the browser cookies placed by a website are used for tracking, as opposed to other cookies that don’t identify the user and those used for other purposes such as tracking opt-out indicators.

The project team included people with a variety of skills and backgrounds, which was just what we needed. We had members who knew how to collect the data, people who know how to present the data through a Firefox extension, those who know how to process the data, and yet others who are familiar with advertising practices and common cookie use practices.

I had been a bit concerned about how effective I would be at writing code, since it has been a while since I have written very much. I do wish that my coding skills were fresher, but I did manage to crank out the code needed for my task. But, as I found out at the one hackathon I had been to previously, writing code isn’t the only thing that happens. There is infrastructure to set up, documentation to write, and graphics to prepare. So there’s something for everyone with even peripherally relevant skills.

Part of the idea at a codeathon is to produce something, even if it isn’t perfect, by a specified deadline. Our project is described here, and you can download the proof-of-concept Firefox extension to try it for yourself.

I had a fabulous time at the Data Transparency weekend. I got to talk with some amazing people, connect with some that I had only conversed with via Twitter, meet many new people working in the field, pick up some inspiration and motivation, hone my coding skills, and learn a whole lot. I made some new connections for OneID as well. Not bad for a weekend.

Some take-aways:

  • The culture, at this event anyway, was not intimidating. People were available to help, and it was always done in a manner that was supportive.
  • If there was one thing that I wish I had learned better before the event, it was the use of github, the popular source code management and collaboration system. I ended up asking for more help with that than with anything else.
  • We should probably have started a little earlier on our project writeup and website. A common mistake.

My thanks to the Wall Street Journal folks and other organizers who provided wonderful facilities and support, kept us very well fed throughout the event, and basically took care of any annoyance that might distract from our productivity and enjoyment of the event. Thanks to the other members of my team as well. I wasn’t ready for the weekend to end (maybe this means I got more sleep than I’m supposed to). I’m hoping they had a great experience as well, and that they’ll sponsor more of these.

More info on the event:

Passwords Are Bad, But Security Questions Are Worse

Everyone, by now, has run into those “security questions” – sets of questions you need to answer to set up an online account (or sometimes to continue using an existing one). They ask a number of questions that are supposed to identify you in the event that you forget your password or (less frequently) need to be contacted to confirm that some online activity isn’t fraudulent. The name “security questions” tends to imply that they improve the security of your account, but much of the time the opposite is true. It also points to another problem with the username/password infrastructure we have, and more generally with the use of shared secrets for authentication.

When you are asked to define and answer security questions, it isn’t always clear how they will be used. The most common usage is to reset your password, as shown in the attached illustration. The motivation for this type of security question is all about minimizing support costs: if they can automate the password recovery process, it reduces the number of customer service people needed and saves money. But the security questions, and their answers, become a second far less secure password. It doesn’t do much good to require upper and lower case characters, digits, and special characters in a password if an attacker only has to pretend to have forgotten the password and now has only to guess something easier, like your pet’s name.

Another way that security questions are used are to verify who you are if a financial institution contacts you because there is potentially fraudulent activity in your account. It’s important to remember that most of the authentication here comes from the telephone network: it’s likely to be you since they called your phone number but a little bit of extra assurance might be nice. Of course, you don’t have any real assurance that it really is the bank calling; it could also be someone calling to get the answer to a password recovery security question they’re trying to answer.

Security questions embody several of the worst practices with respect to shared secrets:

  • Easy to derive from available sources, like social networks and genealogy sites
  • Low cryptographic entropy (easy dictionary attack)
  • Shared among many sites, since many sites use the same questions
  • Difficult to change if answers are compromised
  • Likely to be accessible to the relying party’s customer service staff in the clear

Let’s look at a typical set of security questions.

  1. “In what city did you meet your spouse?” There’s a good likelihood that it’s either the city you live in, one you lived in previously, or somewhere close to one of these. Access to a Facebook or LinkedIn profile narrows the likely possibilities greatly.
  2. “What was your childhood nickname?” Not everybody has a unique nickname, and those that do may have childhood friends posting to their Facebook wall with that name.
  3. “What is the name of your favorite childhood friend?” Even if the friend is not currently a Facebook friend, the likely choices are fairly limited.
  4. “What is your oldest sibling’s birthday month and year?” Even without access to data about the sibling, they’re likely to have been born within a few years of you, and there are only 12 months, for only about 200 possibilities.
  5. “What is the middle name of your youngest child?” Many people give their children family-related middle names, in which case this is very easily guessed.
  6. “What is your oldest sibling’s middle name?” Same comment about use of family names as middle names.
  7. “What school did you attend for sixth grade?” If you grew up in a small town as I did, and a social networking profile indicates where that is, the answer is known.
  8. “What was the name of your first stuffed animal?” I had to laugh at this one; it seems to relate to a different definition of “security” than I thought we were discussing.
  9. “In what city did your mother and father meet?” Genealogy sites might have this information.
  10. “What was the name of your third grade teacher?” Again, one of the better questions, at least for me. For a younger person, there is some likelihood that the same teacher is still teaching third grade.
  11. “What was your maternal grandmother’s maiden name?” Often can be obtained from genealogy sites.
  12. “In what city was your first job?” If you know where someone grew up, that’s likely to be where their first job was as well.
  13. “In what city were you born?” This information can be obtained from lots of places, including social networking, genealogy, and newspaper archives.
  14. “What was the name of your first pet?” If you need help guessing, there are resources for popular pet names.

Some sites give the ability for people to set their own questions, which might seem like a better approach. But user security awareness is so bad that a colleague says that they have even had to ban questions of the form, “My password is XXX?”

In summary, “security questions” usually degrade, rather than contribute to security. This is yet another of the problems with passwords, and with shared secrets more generally, that OneID addresses.

[Hat-tip to Avivah Litan of Gartner whose blog post got me thinking more about this and prompted me to dredge up some earlier work I had done on this topic.]