Yuhong Nan, "Semantics-Driven, Learning-Based Privacy Discovery in Mobile Apps"




CERIAS Weekly Security Seminar - Purdue University show

Summary: A long-standing challenge in analyzing information leaks within mobile apps is to automatically identify the code operating on sensitive data. With all existing solutions relying on System APIs (e.g., IMEI, GPS location) or features of user interfaces (UI), the content from app servers, like user’s Facebook profile, payment history, fall through the crack. In this talk, I will introduce ClueFinder, a novel semantics-driven solution for automatic discovery of sensitive user data, including those from the server side. ClueFinder utilizes natural language processing (NLP) to automatically locate the program elements (variables, methods, etc.) of interest, and then performs a learning-based program structure analysis to accurately identify those indeed carrying sensitive content. Using this new technique, we analyzed over 400k popular apps, an unprecedented scale for this type of research. Our findings brings to light the pervasiveness of information leaks, and the channels through which the leaks happen, including unintentional over-sharing across libraries and aggressive data acquisition behaviors.