How to Succeed with Machine Learning

Learn the basic do's and don'ts for getting started with a machine learning solution.

Q & A

About Dr. Samuel Ackerman

Dr. Samuel Ackerman is a data science researcher at IBM Research, Haifa. Currently, he is focused on statistical and ML analysis for improving hardware verification systems, and providing general statistical consulting within IBM. Sam received his Ph.D. in statistics from Temple University in Philadelphia, PA (2018), where his thesis research involved applying Bayesian particle filtering algorithms to analyzing animal movement patterns. Previously to IBM, he worked as a research assistant at the Federal Reserve Board of Governors (Washington, DC), and as an instructor in business math at Temple.

Sam welcomes any feedback on the topics discussed below, as well as any suggestions for questions or topics people would like to see discussed, especially if they could be beneficial to the general IBM community. Please use the email link to the right.

Question 01: I have a sample of values of a random variable. I want to conduct statistical inference on the variable based on the sample, for instance to construct a confidence interval for the value or estimate its distribution, while making as few statistical assumptions as possible. What can I do?

Question 02: What are good ways to represent text documents (long or short) as features for model-building?

Question 03: I have some data I am monitoring over time. It is possible that there will be some change to the data, and I want to detect this change. I also want to have statistical guarantees about the accuracy of my decision, but want to make as few assumptions about the data as possible. What is the correct way to formulate this problem?

Question 04: I have two samples, each representing a random draw from a distribution, and I want to calculate some measure of the distance between them. What are some appropriate metrics?

Question 05: I have multiple independent distributions (i.e., random variables), or multiple independent samples (i.e., draws from random variables). I want to estimate the distribution of some combination of these. How do I do this?

Question 06: I want to construct a confidence interval (CI) for a statistic. Many CI formulas are calculated in a form similar to " \text{value} \pm 1.96\times\text{standard deviation} " When is it appropriate to use these, and how do I figure out which one to use?

Question 07: I have a fixed set of (named) objects, or a fixed table of class variable levels with fixed probabilities. I want to generate a random sample with replacement from my sample or of levels of the class variable, where the resulting relative frequencies most closely match the fixed sample or probabilities. What is a good way to do this?

Question 08: I have a sample of data points that represent some population of interest. I would like a flexible method that is nonparametric that uses the observed distribution to determine if a given data point is an outlier (anomaly) relative to this distribution. How can I deal with this?

Question 09: Given two data samples, I would like to conduct a test to identify regions of the domain where the two sample distributions differ significantly. How can I do this?

Question 10: I have two (or more) models whose performance I want to compare, either on the same or different datasets. I may have conducted cross-validation on each, and thus have a set of results (e.g., accuracy scores) on each, and there may be overlaps in the ranges of these values. How do I statistically decide which model I should prefer?

Question 11: I have two datasets, D_1 and D_2, of multivariate samples of the same set of features. I would like to both 1) test whether D_2 appears to have drifted in distribution relative to D_1 , and 2) score the features’ contribution to the drift. That is, I want to know which feature(s) in D_2 are most anomalous relative to their observed distribution in D_1.

 

Samuel Ackerman, Analytics & Quality Technologies, IBM Research - Haifa

Samuel Ackerman,
Analytics & Quality Technologies,
IBM Research - Haifa

More on IBM Research