Skeleton-based action recognition has been widely investigated considering their strong adaptability to dynamic circumstances and complicated backgrounds. To recognize different actions from skeleton sequences, it is essential and crucial to model the posture of the human represented by the skeleton and its changes in the temporal dimension. However, most of the existing works treat skeleton sequences in the temporal and spatial dimension in the same way, ignoring the difference between the temporal and spatial dimension in skeleton data which is not an optimal way to model skeleton sequences. The posture represented by the skeleton in each frame is proposed to be modeled individually. Meanwhile, capturing the movement of the entire skeleton in the temporal dimension is needed. So, we designed Spatial Transformer Block and Directional Temporal Transformer Block for modeling skeleton sequences in spatial and temporal dimensions respectively. Due to occlusion/sensor/raw video, etc., there are noises on both temporal and spatial dimensions in the extracted skeleton data reducing the recognition capabilities of models. To adapt to this imperfect information condition, we propose a multi-task self-supervised learning method by providing confusing samples in different situations to improve the robustness of our model. Combining the above design, we propose our Spatial-Temporal Specialized Transformer∼(STST) and conduct experiments with our model on the SHREC, NTU-RGB+D, and Kinetics-Skeleton. Extensive experimental results demonstrate the improved performances and analysis of the proposed method.