CIKM Industry Day Monday, October 17

October 17, 2022 9:00 AM
Conference Keynote - Ensemble Learning Methods for Dirty Data
Ling Liu
Georgia Tech.

Neural network ensemble is a collaborative learning paradigm that utilizes multiple neural networks to solve a complex learning problem. Constructing predictive models with high generalization performance is an important and yet most challenging goal for robust intelligence systems in the presence of dirty data. Given a target learning task, popular approaches have been dedicated to find the top performing model. However, it is difficult in general to estimate the best model when available data is finite, possibly dirty, and insufficient for the problem. In this keynote, I will give an overview of a diversity-centric ensemble learning framework developed at Georgia Tech, including methodologies and algorithms for measuring, enforcing, and combining multiple neural networks by improving generalization performance of the overall system and maximizing ensemble utility and resilience to dirty data.

Peachtree Ball Room - 9.00 AM to 10.00 AM

October 17, 2022 10:30 AM
On the Challenges of Podcast Search at Spotify
Mi Tian, Claudia Hauff and Praveen Chandar
October 17, 2022 10:50 AM
Fifty Shades of Pink: Understanding Color in e-commerce using Knowledge Graphs
Lizzie Liang, Sneha Kamath, Petar Ristoski, Qunzhi Zhou and Zhe Wu
eBay inc.

The color of the products is one of the most prevalent aspects in many e-commerce domains, and it is one of the decisive purchasing factors. Besides having thousands of color variations and shades, many brands continuously develop proprietary colors and color names to attract more customers. This often leads to color ambiguity (textual and visual), and vocabulary mismatch between buyers and sellers. Therefore, it is crucial for any e-commerce search engine to correctly identify the buyer's color intent and match it to the corresponding product listings. To address this challenge, in this work, we introduce a color query expansion approach using color Knowledge Graphs. We use Knowledge Graphs to unambiguously identify all the colors based on their properties, and the relationships to other colors, which allows us to perform semantic query expansion. Similar expansion concepts could be applied to domains outside of color.

Augusta H - 10:50 - 11:10 AM

October 17, 2022 11:10 AM
Shoe Size Resolution in Search Queries and Product Listings using Knowledge Graphs
Petar Ristoski, Aritra Mandal, Simon Becker, Anu Mandalam, Ethan Hart, Sanjika Hewavitharana, Zhe Wu and Qunzhi Zhou
eBay inc.

The Fashion domain is one of the most profitable domains in most of the e-commerce shops, shoes being one of the top-selling categories within this domain. When shopping for shoes, one of the most important aspects for the buyers is the shoe size. Shoe size charts differ between different brands, geographical regions, genders and age groups. Not providing some of these details, as a buyer or a seller, could lead to a query intent to inventory mismatch and reduced or wrong search results. Furthermore, buying the wrong shoe size is one of the top reasons for product returns, which causes shipping delays and loss in revenue. To address this issue, we propose an approach for shoe size resolution and normalization in search queries and product listings using Knowledge Graphs.

Augusta H - 11: 10 - 11:30 AM

October 17, 2022 11:30 AM
Utilizing Contrastive Learning To Address Long Tail Issue in Product Categorization
Lei Chen and Tianqi Wang

Neural network models trained in a supervised learning way have become dominant. Although high performances can be achieved when training data is ample, the performance on labels with sparse training instances can be poor. This performance drift caused by imbalanced data is named as long tail issue and impacts many NN models used in reality. In this talk, we will firstly review machine learning approaches addressing the long-tail issue. Next, we will report on our effort on applying one recent LT-addressing method on the item categorization (IC) task that aims to classify product description texts into leaf nodes in a category taxonomy tree. In particular, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations using the K-positive contrastive loss (KCL) and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative instances (FN) may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.

Augusta H - 11:30 - 11:50 AM

October 17, 2022 11:50 AM
Geographical Address Models in the Indian e-Commerce
Ravindra Babu Tallamraju
Sahaj AI

Unambiguous customer addresses are important for the e-Commerce companies in timely and accurate delivery of shipments. In many developing countries a prescribed structure is not usually followed in practice. It is observed that the customer addresses contain addi- tional text such as instructions to the delivery team, spelling errors, jumbled characters and inadvertent spacing and merging. Further not every address has an associated geolocation information in many countries. But address understanding, address classification, and clustering of similar but noisy addresses are critical. They help in last mile delivery solutions, help reduce buyer fraud and under- standing returns. In view of noisy nature of customer addresses and non-availability of associated geolocation across all the addresses, the above problems are effectively solved with the help of NLP and deployed. The proposed talk traces the challenges in the Indian addresses in the context of e-commerce, solution approaches, and their extensions during the last 8 years. The talk is based on the author’s own experience, his publications as well as developments in this topic across the industry over these years.

Augusta H - 11:50 AM- 12:10 PM

October 17, 2022 12:10 PM
From Product Searches to Conversational Agents for E-Commerce
Giuseppe Di Fabbrizio

As consumers’ demand for online shopping substantially increased in the last few years, e-commerce companies are still far from providing high-quality user experience that may compete with in-store experiences. On the one hand, matching search queries with highly relevant products for discovery and browsing is still a challenge within existing search technologies. Available e-commerce solutions hardly provide tools to optimize product search relevance and fail to integrate user behavior signals into the search optimization pipeline. On the other hand, accessing the rich and complex information concealed in an e-commerce catalog through a search bar has not evolved far since its initial adoption.In this talk, we illustrate how the VUI conversational AI platform has been successfully adopted to both improve the user’s experience quality with highly relevant search and discovery results, and expand the traditional search bar with conversational agents’ technology, enriching the user’s experience at each stage of the e-commerce product life cycle. We review in depth some of the key deep learning models as part of the query understanding component and discuss the overall conversation architecture as it integrates with an existing e-commerce catalog. We include real-life demonstrations derived from use cases extracted from deployed systems.

Augusta H - 12:10 - 12:30 PM

October 17, 2022 12:30 PM
Leveraging Automated Search Relevance Evaluation to Improve System Deployment: A Case Study in Healthcare
Yizhao Ni, Ferosh Jacob, Priya Gopi Achuthan, Hui Wu and Faizan Javed
Kaiser Permanante

Traditional software testing techniques that rely on limited use cases and consistent behavior are neither comprehensive nor specific for capturing complex user search behaviors. To support system deployment, we utilize information retrieval (IR) technologies to monitor search performance, detect development bugs, identify areas of improvement, and suggest actionable items. In this case study we share industrial experience on building an IR evaluation pipeline and its usage to inform deployment and improve system development. The work emphasizes domain specific challenges, best practices and lessons learned during system deployment in a healthcare setting. It features the ability of IR techniques to strengthen collaboration between data scientists, software engineers and managers in making data-driven decisions.

Augusta H - 12.30-12.45 PM

October 17, 2022 2:00 PM
Industry Day Keynote - Customer Obsessed Science
Vanessa Murdock
Amazon Alexa Shopping
October 17, 2022 3:30 PM
Invited Talk - The journey towards cross lingual product search and e-Commerce: the case of Spanglish
Leonardo Lezcano
Walmart Labs

Online stores in the US offer a unique scenario for Cross-Lingual Information Retrieval due, for example, to the frequent mix of Spanish and English in queries. While physical stores have traditionally allowed a visual product search by walking the aisles, the hyper e-Commerce adoption driven by the COVID pandemic has turned the lack of familiarity with specific English terms into an obstacle for a smooth online experience. Machine Translation can lift query relevance in this scenarios, but context scarcity, language ambiguity and polysemy-derived problems, among others, make generic MT an impractical solution. In this talk, we will be covering the multilingual challenges that e-Commerce websites deal with in their specific domains. Beyond the translation accuracy, and given that it may not be the final goal but the means to a frictionless buying experience, the problem of content and query translatability will also be addressed. It will be a good opportunity to discuss the pros and cons of the different approaches that tackle these issues.

Augusta H - 3.30 - 4.00 PM

October 17, 2022 4:00 PM
Executable Knowledge Graph for Transparent Machine Learning in Welding Monitoring at Bosch
Zhuoxun Zheng, Baifan Zhou, Dongzhuoran Zhou, Ahmet Soylu and Evgeny Kharlamov

With the development of Industry 4.0 technology, modern industries such as Bosch's welding monitoring witnessed the rapid widespread of machine learning (ML) based data analytical applications, which in the case of welding monitoring has led to more efficient and accurate welding monitoring quality. However, industrial ML is affected by the low transparency of ML towards non-ML experts needs. The lack of understanding by domain experts of ML methods hampers the application of ML methods in industry and the reuse of developed ML pipelines, as ML methods are often developed in an ad hoc manner for specific problems. To address these challenges, we propose the concept and a system of executable Knowledge Graph (KG), which formally encode ML knowledge and solutions in KGs, which serve as common language between ML experts and non-ML experts, thus facilitate their communication and increase the transparency of ML methods. We valuated our system extensively with an industrial use case at Bosch, including a user study and workshops. The evaluation demonstrates the system offers a user-friendly way for even non-ML experts to discuss, customise, and reuse ML methods.

Augusta H - 4:00 - 4:15 PM

October 17, 2022 4:15 PM
Synerise Monad - Real-Time Multimodal Behavioral Modeling
Jacek Dąbrowski and Barbara Rychalska

The growth of time-sensitive heterogeneous data in industry-grade datalakes has recently reached unprecedented momentum. In response to this, we propose Synerise Monad - a prototype of a real-time behavioral modeling platform for event-based data streams. It automates representation learning and model training on massive data sources with arbitrary data structures. With Monad we showcase how to automatically process various data modalities, such as temporal, graph, categorical, decimal, and textual data types, in a time-sensitive way allowing for real-time time feature creation and predictions. Monad's distributed and scalable architecture coupled with efficient award-winning algorithms developed at Synerise - Cleora and EMDE - allows to process real-life datasets composed of billions of events in record time. The Monad ecosystem showcases a viable path towards real-time event-based AutoML.

Augusta H - 4:15- 4:35 PM

October 17, 2022 4:35 PM
Building Next Best Action Engines for B2C and B2B Use Cases
Ilya Katsov
Grid Dynamics

Traditional machine learning methods used in marketing and digital commerce applications, including propensity scoring and many recommendation algorithms, are usually focused on improving short-term outcomes such as click-through rates. In many environments, however, long-term customer engagement can be more important than immediate outcomes. In this paper, we describe several real-world case studies on building personalization engines that address this problem using reinforcement learning (RL) methods. We also discuss the design patterns used to create such solutions.

Augusta H - 4:35 - 4:55 PM

October 17, 2022 4:55 PM
Building Natural Language Processing Applications with EasyNLP
Chengyu Wang, Minghui Qiu and Jun Huang

The successful application of Pre-Trained Models (PTMs) has revolutionized the development of Natural Language Processing (NLP) by large-scale self-supervised pre-training. However, it is not easy to obtain high-performing models in domain-specific applications and deploy them online with strict QPS (Query Per Second) requirements for industrial practitioners. To solve these issues, the EasyNLP toolkit is designed for building PTM-based NLP applications with ease, which supports a comprehensive suite of NLP algorithms and is suitable for meeting the inference requirements in industry. It features knowledge-enhanced pre-training that captures rich domain knowledge to better support domain-specific applications. In addition, the knowledge distillation and prompt-based few-shot learning functionalities are provided to improve the performance of large-scale PTMs with little training data available, and to distill models to smaller ones that are suitable for online deployment. EasyNLP provides a unified framework of model training, inference and deployment for real-world applications, using simple high-level APIs or command-line tools. Currently, EasyNLP has powered over ten business units within Alibaba Group and is seamlessly integrated to the Platform of AI (PAI) products on Alibaba Cloud. EasyNLP is also beneficial for academia, as it integrates state-of-the-art methods and models to make it easy for researchers to benchmark and develop their own algorithms. We have released EasyNLP to public at GitHub (https: //

Augusta H - 4:55 - 5:15 PM

October 17, 2022 5:15 PM
Simulating complex problems inside a database
Nikolaos Vasiloglou
Relational AI

Python has been a very successful language for coding AI algorithms. SQL is the defacto language for relationalmodeling of business processes. We are accustomed to using a relational database as the source for generatingfeatures that are consumed by a machine learning predictive model in python. Is this the optimal way touse data for making predictions and making decisions over business problems? While predictive models arequite powerful they often fail to model complicated sequential processes that have complicated dependencies.We typically model these dependencies as Knowledge Graphs. We believe that when a business problempresents a complicated knowledge Graph then the agent based simulation paradigm is ideal for revealinguseful insights for making business decisions that go beyond inference. In modern machine learning thisframework is often called Probabilistic Programming. The reason why we don’t see simulation widely adoptedis the lack of appropriate data-centric languages. In this talk we will discuss how we used a novel languageideal for describing Knowledge Graphs in order to built simulators for business problems. We will go throughtwo simple cases, modeling the flooding problem of a city and modeling the airline delay flight pattern. In theend we will explore how a database that enables simulation can unlock the potential of building reinforcementlearning for optimizing and solving extremely hard problems

Augusta H - 5:15 - 5:35 PM

October 17, 2022 5:35 PM
Intent Disambiguation for Task-oriented Dialogue Systems
Andrea Alfieri, Ralf Wolter and Seyyed Hadi Hashemi

Task-Oriented Dialogue Systems (TODS) have been widely deployed for domain specific virtual assistants at contact centres to route customers' calls or deliver information needs of the customer in a conversational interaction.TODS employ natural language understanding components in order to mapuser commands to a set of pre-defined intents. However, Contact Centre users often fail to formulate their complex information needs in a single utterance which leads to formulating ambiguous user commands. This can negatively impact intent classification, and consequently customer satisfaction.To avoid feeding ambiguous user commands to the intent classifier of virtual assistants and help users in formulating their commands, we have implemented a solution that (1) identifies when a user is ambiguous and the virtual assistant should ask a clarification question, (2) disambiguates the user command and provides top-N most likely intents in a form of a clarification question.Our experimental result shows that our proposed solution has an statistical significant positive impact on disambiguating users' intents for a virtual assistant of a contact centre.

Augusta H - 5:35 - 5.55 PM