We address the problem of liveness detection in audiovisual recordings for preventing spoofing attacks in biometric authentication systems. We assume that liveness is detected from a recording of a speaker saying a predefined phrase and that another recording of the same phrase is a priori available, a setting, which is common in text-dependent authentication systems. We propose to measure liveness by comparing between alignments of audio and video to the a priori recorded sequence using dynamic time warping. The alignments are computed in a joint feature space to which audio and video are embedded using deep convolutional neural networks. We investigate the robustness of the proposed algorithm across datasets by training and testing it on different datasets. Experimental results demonstrate that the proposed algorithm generalizes well across datasets providing improved performance compared to competing methods.