Conference paperHierarchical attention based spatial-temporal graph-to-sequence learning for grounded video description