Not sure, could be the large number of Spanish dialects represented in the dataset, label noise, or something else. There may just be too much diversity in the class to fit neatly in a cluster.
Also, the training dataset is highly imbalanced and Spanish is the most common class, so the model predicts it as a sort of default when it isn't confident -- this could lead to artifacts in the reduced 3d space.
We didn't explicitly. Because we finetuned this model for accent classification, the later transformer layers appear to ignore non-accent vocal characteristics. I verified this for gender for example.
People have all sorts of motivations for learning languages and accents. Right now, I'm using this tech to work on my accent in Spanish. Honestly I would rather mumble almost unintelligibly with an decent Mexican accent than speak Spanish slowly and clearly with an American accent. There is a difficult but necessary period of learning an accent where intelligibility drops. For a while, I made a strange [ð]-like noise when learning the alveolar trill (rolled R), and it would have been more intelligible to use something like alveolar tap. But, I built up the muscle memory, and can now make the correct sound. Hearing a version of myself (rather than a different speaker) gives me a more useful target to mimic, and the distance metric gives me a useful measure of whether I'm closer or further from the target.