A Deeper Dive
And the coding begins...
To start the corpus creation of treated text, we first need to take a first look at our data. The files can be found in the Stanford University Library repository in their French Revolution Digital Archive. On this site, you can see each page available available for download as a scanned image jpeg, pdf, a raw text file, or as an XML file. At this point, I had to make the decision of how I wanted to save the data on the site locally to my machine. Images and PDFs would not be useful, but should I save the archives as text files or XML? XML would be the better choice, since it was already structured into a parseable format.
To begin, I first downloaded all the data from the Stanford University Library repository. There are 101 tomes of data in the Archives Parlementaires. Upon further inspection, it became clear that only tomes 8-31 were relevant to this project. Tomes 1-8 were focused on the États généraux and not the Assemblée Nationale. All tomes after tome 31 were beyond our timeline of September 1789 to September 1791.Therefore, tomes 8-31 were downloaded into an archives folder as TEI XML files so that they maintained their nested and tagged structure.
Here is a deconstructed, general view of the XML Nest Structure of an archived file within a tome. Once we understand what is going on in the tree, we can use Beautiful Soup to pull data out from within the tags. This way, we can seperate speeches nested within sessions and then pull out associated information, like the date of the speech, the speaker, and the raw text of what that speaker said.
#TOME X
<text>
<body>
<div1 type = "volume" n="8">
#SESSION
<div2 type = "session">
#PARLIAMENT & LEADER
<head>ASSEMBLÉE NATIONALE. < /head>
<head> présidence de m. bailly. < /head>
#DATE OF SESSION
<p> Séance du <date value="1789-06-30"> mardi 30 juin 1789 < /date> < /p>
#END OF DATE
#SPEECH
<sp >
#SPEAKER
<speaker> M. le Présidentspeaker>
#RAW TEXT OF SPEECH
<p > < /p>
<p > < /p>
<p > < /p>
<p > < /p>
#END OF SPEECH
< /sp>
#END OF SESSION
< /div2 >
#END OF TOME
< /div1>
< /body>
< /text>
We can see here that corresponding tags are nested within each other. After looking at this structure, I wrote a test script on only one of the 24 total tomes. I used Beautiful Soup to parse the XML and then I took that scraped data and wrote it into a tsv file.
Generally, to get to the raw speech text, we will have to first find every div2 with type session since speeches are only located within sessions. Then we will find every sp tag within the div2 tag and concatenate all p tag text together. In my next blog post, I will go over exactly how I used Beautiful Soup to do this.