About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICDMW 2010
Conference paper
ALPOS: A machine learning approach for analyzing microblogging data
Abstract
With the development of Internet, the increasing volume of information posted on micro-blogging sites like Twitter necessitates the need for efficient information filtering. In conventional text classification problems, it is assumed that the feature vectors extracted from the available documents are sufficient to learn good classifiers. However, this conventional approach is not likely to work for Twitter due to the limited number of characters on each tweet. From a higher level, each tweet can be viewed as an abbreviated abstraction of a long document, and we only have a partial observation of this document. To solve the problem caused by the partial observations, we introduce a novel domain adaption/transfer learning approach called Assisted Learning for Partial Observation (ALPOS). The basic idea is to use a large number of multi-labeled examples (source domain) to improve the learning on the partial observations (target domain). In particular, we learn a hidden, higher-level abstraction space, which is meaningful for the multi-labeled examples in the source domain. This is done by simultaneously minimizing the document reconstruction error and the error in a classification model learned in the hidden space by using known labels from the source domain. The partial observations in the target space are then mapped to the same hidden space for recovery and classification. We compare the performance of this method with existing approaches on synthetic data and the well-known Reuters-21578 dataset. We also present experimental results on twitter classification. © 2010 IEEE.