I agree that the nominative case endings and usual SVO order of English should prevent (nearly all) false positives.
(The only one I can think of is that comparatives are meant to take the nominative case, e.g. pedantically it's supposed to be "he" rather than "him" in "A man smarter than he would decline". However, this rather convoluted sentence structure is rare, even more unlikely in a plot summary rather than the full novel text, and almost always followed by the subjunctive rather than an indicative verb.)
My bigger concern was what doesn't get caught because it appears after a noun rather than a pronoun, or has an adverb in the way. I thought major dramatic plot points might be more likely to use the character's name rather than a pronoun, and so we might see fewer words like "murders", "defeats", etc. From the results it seems like those words are still present in large quantities, so perhaps I'm wrong.
I'd like to see it done with "they" too, partly as a control case and partly to see if any verbs are more common for individuals rather than groups (although the rise of impersonal "they" may hinder that aspect of the analysis).
> My bigger concern was what doesn't get caught because it appears after a noun rather than a pronoun,
Yeah, some clever gender/name lookup would be a good idea, many subjects in plots are the characters' names. Maybe even more subject names than pronouns.
That said, it seems unlikely that there's a pronoun-substitution gender bias, so it would probably just yield more samples of the same trend.
What doesn't get caught is a super interesting question. (As is object genders too.) How often he occurs vs she is important. How long he is talked about vs how long she is talked about; I'm not certain about screenplay conventions, but usually the first sentence would name the character and some number of subsequent sentences might use she rather than repeat her name.
So yeah, there's plenty more room for making this analysis scientific, and no reason to assume it's unbiased.
(The only one I can think of is that comparatives are meant to take the nominative case, e.g. pedantically it's supposed to be "he" rather than "him" in "A man smarter than he would decline". However, this rather convoluted sentence structure is rare, even more unlikely in a plot summary rather than the full novel text, and almost always followed by the subjunctive rather than an indicative verb.)
My bigger concern was what doesn't get caught because it appears after a noun rather than a pronoun, or has an adverb in the way. I thought major dramatic plot points might be more likely to use the character's name rather than a pronoun, and so we might see fewer words like "murders", "defeats", etc. From the results it seems like those words are still present in large quantities, so perhaps I'm wrong.
I'd like to see it done with "they" too, partly as a control case and partly to see if any verbs are more common for individuals rather than groups (although the rise of impersonal "they" may hinder that aspect of the analysis).