Almost there! But too many speeches?

the numbers don't match...

After reading the archive files and writing specific information to a TSV, I wanted to check to see if I was on track with the metrics given in the PNAS writeup. Their data cleanup had the ultimate numbers of 9,930,592 words in their raw corpus. I had 11,060,994 which was on the order of 1 million more words: a problem. They also had 44,953 total speeches whereas I had 45,768 (815 too many). I would need to find a way to get our numbers closer together.

My first thought was to remove all speeches that had no speaker (speaker tag was empty, or n/a). This only removed 54 speeches and was not enough. Then I tried removing entire sessions where the President = ANNEXE. This removed far too many speeches to the point where I was significantly below the target number. That was not even including other spelling of annexe, or when annexe was in lowercase. Then I tried removing all speeches with dates in Sept 1791 to see if the date range was not inclusive of this last month. I was 1,476 speeches to few with this removal, and in a worse place than before when I was only 815 speeches off.

I ultimately went back to the speeches script. After a conversation with my mentor Laure, we determined that it may be a BeautifulSoup issue iterating over too many speeches. So we added "recursive=false" when finding all speech tags, and this eliminated around ~600 speeches. We were the most comfortable with this modification and decided to leave it.

Now that the raw text vocabulary is mostly sorted out, I will write a cleaning script to make all raw text lowercase, remove words less than three characters, separate contractions, and remove words in the stopwords list: 48 French words ('les', 'que', 'des', 'qui', 'est', 'vous', 'dans', 'pour', 'une', 'pas', 'par', 'sur', 'nous', 'cette', 'aux', 'mais', 'ils', 'leur','être', 'sont', 'ces', 'ont', 'elle', 'tous', 'avec', 'faire', 'son','ses', 'dont', 'comme', 'votre', 'soit', 'lui', 'peut', 'leurs', 'donc','avez', 'doit', 'faut', 'sera', 'était', 'vos', 'ceux', 'avoir', 'cet', 'nos', 'ainsi', 'avait'). I will describe this process in my next blog post.