CIMK 2022

Tutorials

The CIKM 2022 tutorial program will host 7 compelling tutorials that highlight the breadth of interesting problems being explored in the fields of Information and Knowledge management.

#
Title
Contact
Website
1
Information extraction from social media: A hands-on tutorial on tasks, data, and open source tools
Shubhanshu Mishra
Email
Full-Day
Organizers
Shubhanshu Mishra (Twitter Inc.), Rezvaneh Rezapour (Drexel University) and Jana Diesner (University of Illinois at Urbana-Champaign).
Abstract

Information extraction (IE) is a common sub-area of natural language processing that focuses on identifying structured data from unstructured data. The community of Information Retrieval (IR) relies on accurate and high-performance IE to be able to retrieve high quality results from massive datasets. One example of IE is to identify named entities in a text, e.g.,``Katy Perry lives in the USA''. Here, Katy Perry and USA are named entities of types of PERSON and LOCATION, respectively. Another example is to identify sentiment expressed in a text, e.g., ``This movie was awesome''. Here, the sentiment expressed is positive. Finally, identifying various linguistic aspects of a text, e.g., part of speech tags, noun phrases, dependency parses, etc., which can serve as features for additional IE tasks.This tutorial introduces participants to a) the usage of Python based, open-source tools that support IE from social media data (mainly Twitter), and b) best practices for ensuring the reproducibility of research. Participants will learn and practice various semantic and syntactic IE techniques that are commonly used for analyzing tweets. Additionally, participants will be familiarized with the landscape of publicly available social media data (including popular NLP and IE benchmarks) and methods for collecting and preparing them for analysis. Finally, participants will be trained to use a suite of open source tools (SAIL for active learning, TwitterNER for named entity recognition, TweetNLP for transformer based NLP, and SocialMediaIE for multi task learning), which utilize advanced machine learning techniques (e.g., deep learning, active learning with human-in-the-loop, multi-lingual, and multi-task learning) to perform IE on their own or existing datasets. Participants will also learn how social context can be integrated in Information Extraction systems to make them better and the role of time in social media IE quality. The tools introduced in the tutorial will focus on the three main stages of IE, namely, collection of data (including annotation), data processing and analytics, and visualization of the extracted information.More details can be found at: https://socialmediaie.github.io/tutorials/.

2
Graph-based Management and Mining of Blockchain Data
Arijit Khan
Email
Half-Day
Organizers
Arijit Khan (Aalborg University) and Cuneyt Gurcan Akcora (University of Manitoba)
Abstract

The mainstream adoption of blockchains led to the preparation of many decentralized applications and web platforms, including Web 3.0, a peer-to-peer internet with no single authority. The data stored in blockchain can be considered as big data – massive-volume, dynamic, and heterogeneous. Due to highly connected structure, graph based modeling is an optimal tool to analyze the data stored in blockchains. Recently, several research works performed graph analysis on the publicly available blockchain data to reveal insights into its business transactions and for critical downstream tasks, e.g., cryptocurrency price prediction, phishing scams and counterfeit token detection. In this tutorial, we discuss relevant literature on blockchain data structures, storage, categories, data extraction and graphs construction, graph mining, topological data analysis, and machine learning methods used, target applications, and the new insights revealed by them, aiming towards providing a clear view of unified graph-data models for UTXO and account-based blockchains. We also emphasize future research directions.

3
Fairness of Machine Learning in Search Engines
Yi Fang
Email
Half-Day
Organizers
Yi Fang (Santa Clara University), Hongfu Liu (Brandeis University), Zhiqiang Tao (Santa Clara University) and Mikhail Yurochkin (IBM Research and MIT-IBM Watson AI Lab)
Abstract

Fairness has gained increasing importance in a variety of AI and machine learning contexts. As one of the most ubiquitous applications of machine learning, search engines mediate much of the information experiences of members of society. Consequently, understanding and mitigating potential algorithmic unfairness in search have become crucial for both users and systems. In this tutorial, we will introduce the fundamentals of fairness in machine learning, for both supervised learning such as classification and ranking, and unsupervised learning such as clustering. We will then present the existing work on fairness in search engines, including thefairness definitions, evaluation metrics, and taxonomies of methodologies. This tutorial will help orient information retrieval researchers to algorithmic fairness, provide an introduction to the growing literature on this topic, and gathering researchers and practitioners interested in this research direction.

4
Mining of Real-world Hypergraphs: Patterns, Tools, and Generators
Geon Lee
Email
Half-Day
Organizers
Geon Lee (Korea Advanced Institute of Science and Technology), Jaemin Yoo (Carnegie Mellon University) and Kijung Shin (Korea Advanced Institute of Science and Technology)
Abstract

Group interactions are prevalent in various complex systems (e.g., collaborations of researchers and group discussions on online Q&A sites), and they are commonly modeled as hypergraphs. Hyperedges, which compose a hypergraph, are non-empty subsets of any number of nodes, and thus each hyperedge naturally represents a group interaction among entities. The higher-order nature of hypergraphs brings about unique structural properties that have not been considered in ordinary pairwise graphs.In this tutorial, we offer a comprehensive overview of a new research topic called hypergraph mining. Specifically, we first present recently revealed structural properties of real-world hypergraphs, including (a) static and dynamic patterns, (b) global and local patterns, and (c) connectivity and overlapping patterns. Together with the patterns, we describe advanced data mining tools used for their discovery. Lastly, we introduce simple yet realistic hypergraph generative models that provide an explanation of the structural properties.

5
Tutorial on Deep Learning Interpretation: A Data Perspective
Fang Jin
Email
Half-Day
Organizers
Zhou Yang (George Washington University), Ninghao Liu (University of Georgia), Xia "Ben" Hu (Rice University), and Fang Jin (George Washington University)
Abstract

Deep learning models have achieved exceptional predictive performance in a wide variety of tasks, ranging from computer vision, natural language processing, to graph mining. Many businesses and organizations across diverse domains are now building large-scale applications based on deep learning. However, there are growing concerns, regarding the fairness, security, and trustworthiness of these models, largely due to the opaque nature of their decision processes. Recently, there has been an increasing interest in explainable deep learning that aims to reduce the opacity of a model by explaining its behavior, its predictions, or both, thus building trust between human and complex deep learning models. A collection of explanation methods have been proposed in recent years that address the problem of low explainability and opaqueness of models. In this tutorial, we introduce recent explanation methods from a data perspective, targeting models that process image data, text data, and graph data, respectively. We will compare their strengths and limitations, and offer real-world applications.

6
Learning and Mining with Noisy Labels
Masashi Sugiyama
Email
Half-Day
Organizers
Masashi Sugiyama (RIKEN / The University of Tokyo), Tongliang Liu (The University of Sydney), Bo Han (Hong Kong Baptist University), Yang Liu (University of California, Santa Cruz) and Gang Niu (RIKEN)
Abstract

“Knowledge should not be accessible only to those who can pay” said Robert May, chair of UC’s faculty Academic Senate. Similarly, machine learning should not be accessible only to those who can pay. Thus, machine learning should benefit to the whole world, especially for developing countries in Africa and Asia. When dataset sizes grow bigger, it is laborious and expensive to obtain clean supervision, especially for developing countries. As a result, the volume of noisy supervision becomes enormous, e.g., web-scale image and speech data with noisy labels. However, standard machine learning assumes that the supervised information is fully clean and intact. Therefore, noisy data harms the performance of most of the standard learning algorithms, and sometimes even makes existing algorithms break down.There are a brunch of theories and approaches proposed to deal with noisy data. As far as we know, learning and mining with noisy labels spans over two important ages in machine learning, data mining and knowledge management community: statistical learning (i.e., shallow learning) and deep learning. In the age of statistical learning, learning and mining with noisy labels focused on designing noise-tolerant losses or unbiased risk estimators. Nonetheless, in the age of deep learning, learning and mining with noisy labels has more options to combat with noisy labels, such as designing biased risk estimators or leveraging memorization effects of deep networks. In this tutorial, we summarize the foundations and go through the most recent noisy-label-tolerant techniques. By participating the tutorial, the audience will gain a broad knowledge of learning and mining with noisy labels from the viewpoint of statistical learning theory, deep learning, detailed analysis of typical algorithms and frameworks, and their real-world data mining applications.

7
Self-Supervised Learning for Recommendation
Chao Huang
Email
Half-Day
Organizers
Chao Huang (University of Hong Kong), Lianghao Xia (University of Hong Kong), Xiang Wang (University of Science and Technology of China), Xiangnan He (University of Science and Technology of China) and Dawei Yin (Baidu)
Abstract

Recommender systems have become key components for a wide spectrum of web applications (e.g., E-commerce sites, video sharing platforms, lifestyle applications, etc), so as to alleviate the information overload and suggest items for users. However, most existing recommendation models follow a supervised learning manner, which notably limits their representation ability with the ubiquitous sparse and noisy data in practical applications. Recently, self-supervised learning (SSL) has become a promising learning paradigm to distill informative knowledge from unlabeled data, without the heavy reliance on sufficient supervision signals. Inspired by the effectiveness of self-supervised learning, recent efforts bring SSL's superiority into various recommendation representation learning scenarios with augmented auxiliary learning tasks. In this tutorial, we aim to provide a systemic review of existing self-supervised learning frameworks and analyze the corresponding challenges for various recommendation scenarios, such as general collaborative filtering paradigm, social recommendation, sequential recommendation, and multi-behavior recommendation. We then raise discussions and future directions in this area. With the introduction of this emerging and promising topic, we expect the audience to have a deep understanding of this domain. We also seek to promote more ideas and discussions, which facilitates the development of self-supervised learning recommendation techniques.