学术报告
同济大学计算机科学与技术系智信讲坛第(99)期
题目:Data Quality and Data Mining with Crowdsourcing
报告人:盛胜利教授,阿肯色中央大学计算机科学系
时间: 2017年6月20日星期二14:00
地点:电信大楼403室
组织单位:计算机科学与技术系
邀请人:杨恺教授
报告人简介:盛胜利(VICTOR S. SHENG)是美国阿肯色中央大学计算机科学系副教授和数据分析实验室主任,1999年7月于苏州大学获硕士学位,2003年12月于加拿大新不伦瑞克大学获硕士学位,2007年8月于加拿大西安大略大学获博士学位,2007年9月至2009年8月间于美国纽约大学斯特恩商学院做博士后研究员。研究领域为数据挖掘与机器学习、人工智能、数据安全和决策支持,及其在商业、生物信息学、医疗信息学、软件工程等领域的应用。在国际学术会议和期刊上共发表论文100多篇,包括TPAMI, TKDE, JMLR, TMM, TNNLS和DMKD等,国际学术会议包括IJCAI, KDD, ICML, AAAI, ECML, ICDM, DASFAA, ACM MM, ICMR, ICME, CIKM等,其中CCF推荐的A类期刊和会议论文20余篇,单篇论文被引用最高达680余次。现任ICDM 2017的financial Chair和多个国际期刊编委。2015年荣获WISE最佳学生论文奖finalist; 2011年荣获ICDM大会最佳论文奖; 2008年荣获KDD大会最佳论文奖亚军; 2008年机器学习研讨会Google学生奖获得者; 2006年荣获IEEE Kitchener-Waterloo Section知识和数据挖掘联合研讨会最佳海报奖。
内容提要:Crowdsourcing systems provide convenient platforms to collect human intelligence for a variety of tasks (e.g., labeling objects) from a vast pool of independent workers (a crowd). We first present repeated-labeling strategies of increasing complexity to obtain multiple labels. Repeatedly labeling a carefully chosen set of points is generally preferable. A robust technique that combines different notions of uncertainty to select data points for more labels is recommended. Recent research on crowdsourcing focuses on deriving an integrated label from multiple noisy labels via expectation-maximization based (EM-based) ground truth inference. We present a novel framework that introduces noise correction techniques to further improve the label quality of the integrated labels obtained after ground truth inference. We further show that biased labeling is a systematic tendency. State-of-the-art ground truth inference algorithms cannot handle the biased labeling issue very well. Our simple consensus algorithm performs much better. Finally, we present pairwise solutions for maximizing the utility of multiple noisy labels for learning. Pairwise solutions can completely avoid the potential bias introduced in ground truth inference. They have both sides (potential correct and incorrect/noisy information) considered, so that they have very good performance whenever there are a few or many labels available.
欢迎各位老师同学踊跃参加!