With the recent expansion of Large Language Model (LLM) capabilities, there exists new potential for improving the performance of object detection and classification tasks by taking advantage of the Vision Transformer (ViT) architecture. In this paper, we focus specifically on the problem of object detection and classification on the edge, via a heterogeneous System-on-Chip (SoC). Unique constraints arise in an edge setting, most notably in the amount of available memory - a difficult task given the incredible size of LLMs. Our exploration begins with a traditional Convolutional Neural Network (CNN), running on a small deep learning accelerator, and the issues we faced with this approach on a heterogeneous edge SoC. We transition to a transformer-based architecture, using a ViT adapted for simultaneous object detection and classification running on a Natural Language Processing (NLP) accelerator. In particular, we focus on increasing sparsity in the model to combat the strict memory constraints of the chip and introduction of early-exit mechanisms to minimize end-to-end latency.