Introducing Semantic Data Science (SDS), an intelligent automation system for augmenting data to build AI solutions with domain knowledge that typically requires the painstaking efforts of skilled data scientists.
Many of the steps required to build AI models can be automated to accelerate how quickly they can be developed. But steps that involve understanding semantic concepts, such as feature engineering — augmenting the given data to improve performance — typically require hands-on attention from data scientists.
Data scientists often use their domain knowledge to identify additional features that might enrich their models, as machines can’t identify semantic meaning in data or link it to existing knowledge like humans do. State-of-the-art systems for automated feature engineering rely on trial and error to generate new features by transforming the given data, which has several drawbacks: Some data transformations are difficult to find by chance or using generalized exploration techniques. Others, like adding together two different variables (such as adding height and weight data together), might not make sense in the real world. These methods also can’t leverage external data to improve a model, leaving them unable to capitalize on the vast amount of information available in knowledge bases, wikis and code repositories.
We’re bringing the power of automation to semantics-oriented tasks like feature engineering with Semantic Data Science (SDS). It’s our new automated system that we demonstrated at the Association for the Advancement of Artificial Intelligence (AAAI)’s 2022 conference1, which discovers concepts relevant to a dataset and links them with external knowledge and code to identify new, rich features. Data scientists can explore the context behind the new features before incorporating the ones that will improve model accuracy. By quickly surfacing potential new features with human-interpretable explanations, SDS can help data scientists build better AI models more efficiently.
By quickly surfacing potential new features with human-interpretable explanations, SDS can help data scientists build better AI models more efficiently.
How semantics are used
Our SDS system has two components. We first map concepts to the columns of given data, enabling automated discovery of connections among existing features, and between those features and external data. We also mine large code repositories available online for data transformations that use the same columns as the given data. SDS presents users with suggested new features to augment their data and boost the performance of the AI models they are building.
SDS uses two column-to-concept mappers. One finds the concepts most likely to match column values among different sources of structured data, such as Wikipedia tables or Wikidata and DBpedia knowledge graphs. A second mapper links column names to concepts in knowledge graphs in three ways: First, a notation fluctuation solver identifies different phrases in column names (e.g., “speed” and “velocity”) that map to the same concept by using phrase similarity metrics. Second, an alias solver uses alias information, either explicitly defined in Wikidata or inferred from the anchor texts of links in Wikipedia, to understand different word-level references to the same concept. Third, a context understanding component distinguishes between concepts with the same label (e.g., “Apple” the company versus “apple” the fruit) by assessing the sentence similarity of the column descriptions to entries in Wikidata or DBpedia. SDS automatically annotates the columns with the concepts it identifies, which the user can refine.
Feature engineering often involves transformation of the given data to generate new features. SDS uses three tools that we developed to mine existing code for transformations that might be relevant to the given columns and concepts. One performs interprocedural static analysis to identify similar columns from millions of Python programs. A second tool automatically instruments code to keep track of executions, so we can extract operations with relevant column names. The third tool mines text cells in Jupyter notebooks for formulas and uses question-answering techniques to understand the terms in order to identify operations that could be pertinent to the dataset. SDS presents the features generated via these transformations to users, along with the context surrounding the code. Users can inspect the concepts and code snippets identified by SDS, choose those they wish to include, and apply the appropriate data transformations to expand the feature space before building their models.
Using SDS with real-world data
We used SDS for feature engineering of a public dataset related to the COVID-19 disease that has the following columns: Date, Longitude, Latitude, Province/State, Country/Region, Recovered, Confirmed, and Deaths. SDS extracted the most salient concepts related to the data, such as “COVID,” which is not explicitly mentioned but inferred by the mapping component. SDS then found new features that are related to the available concepts. Some can be directly computed: for example, SDS suggested computing “confirmed_percent” and “deaths_percent” from Country/Region and either Confirmed or Deaths, respectively. Others cannot necessarily be computed from the given data; for instance, SDS identified an epidemiology model that describes the spread of a disease as a potentially related feature, although it doesn’t suggest how to compute or incorporate this information. SDS lets users explore the sources behind the concepts and features it identifies, providing far more insight and assistance than current tools do.
Understanding domain knowledge remains mostly elusive in today’s AI, but SDS demonstrates the initial capability for AI to use semantics. This adds a new and powerful dimension to automated machine learning, expanding the ways automation can assist data scientists and further accelerating the development and deployment of AI models.
In addition, SDS generates human-readable explanations of the new features it identifies, whereas considerable effort is generally required to reverse-engineer the derivation of new features by neural networks. With SDS, users can easily interpret the features, determine whether they are appropriate, and address any errors. While SDS focuses on feature engineering, other aspects of data science also depend heavily on semantics. For example, model explainability is based in part on interpreting concepts in the data and the meaning of the operations performed on them. Automated semantic concept recognition would therefore be central to automating tasks associated with improving explainability.
Conventional AI development relies on data scientists or domain experts to recognize key concepts in a dataset and connect them with external data or real-world knowledge. By bringing automation to semantic tasks, SDS takes a step toward building an infrastructure for semantics-oriented data science based on mapping concepts and mining existing knowledge and code.
IBM Research Semantic Data Science packages some of the state-of-the-art techniques for doing column-to-concept mapping, code analysis and knowledge extraction along with a unified framework to access such techniques. Get started with the Semantic Data Science API, here.
- Srinivas, K., Tateishi, T., Weidele, D. K. I., Khurana, U., Samulowitz, H. Takahashi, T., Wang, D., and Amini, L. (2022, February 22–March 1). Semantic Feature Discovery with Code Mining and Semantic Type Detection. 36th AAAI Conference on Artificial Intelligence, Vancouver, Canada.↩