It seems odd that only one object of each class is tracked in a video segment. I'd imagine that it would be a limitation for algorithms that generate bounding boxes for each class, as they might be penalized for correctly finding the non-tracked instances of a given class. Is only tracking a single instance of a class standard for this kind of dataset?