User authentication on smartphones is the key to many applications, which must satisfy both security and convenience. We propose a novel user authentication system EchoPrint, which leverages acoustics and vision for secure and convenient user authentication, without requiring any special hardware. EchoPrint actively emits almost inaudible acoustic signals from the earpiece speaker to illuminate the user's face and authenticates the user by the unique features extracted from the echoes bouncing off the 3D facial contour. To combat changes in phone-holding poses thus echoes, a Convolutional Neural Network (CNN) is trained to extract reliable acoustic features, which are further combined with visual facial features extracted from state-of-the-art face recognition deep models to feed a binary Support Vector Machine (SVM) classifier for final authentication. Because the echo features depend on 3D facial geometries, EchoPrint is not easily spoofed by images or videos like 2D visual face recognition systems. It needs only commodity hardware, thus avoiding the extra costs of special sensors in solutions like FaceID. Experiments with 62 volunteers and non-human objects such as images, photos, and sculptures show that EchoPrint achieves 93.75% balanced accuracy and 93.50% F-score, while the average precision is 98.05% using acoustic features and basic facial landmarks. The precision is further improved to 99.96% with sophisticated visual features.