Writing to a TSV

Read BeautifulSoup data into columns

Once I got the BeautifulSoup code running, I could then write was I was reading from the archives directly into a saved TSV file. I originally thought to iterate through every filename with the line "with filename as reader", but this was cumbersome and ineffective. Instead, I opened a tsv in writing mode and then iterated through every archive file still using basepath, but saving them as reader objects so that I wouldn't have to close every open "with" line.

I wrote a header line to the TSV file with fields: tome, speech_id, session_id, sate, speaker, and raw_text. Then after using BeautifulSoup to pull out requisite information, I saved it all to the TSV I opened in writing mode with the line: speeches_writer.write('{ }\t{ }\t{ }\t{ }\t{ }\t{ }\n'.format(tome, speech_id, session_id, date, speaker, raw_text)). Then I closed the writer. After running this script, I got a TSV file of 72MB with all the data neatly arranged by column. Here is a version of the script I wrote to get the first real set of workable data.

							from bs4 import BeautifulSoup
import os 


speeches_writer = open('raw_output/all_speeches.tsv', mode='w')
speeches_writer.write("tome\tspeech_id\tsession_id\tdate\tspeaker\traw_text\n")


basepath = "archives/"
for filename in os.listdir(basepath):
	if not filename.endswith('.xml'):
		continue
	print('Reading {}...'.format(filename))
	path = os.path.join(basepath, filename)

	reader = open(path, encoding="utf-8-sig")
	soup = BeautifulSoup(reader, "xml") 
	sessions = soup.find_all("div2", type="session")

	if len(filename) == 9:
		tome = filename[4]
	if len(filename) == 10:
		tome = filename[4:6]
	
	speech_id = 0
	session_id = 0 
	for session in sessions:
		session_id += 1
		date = session.find("date")["value"]
		speeches = session.find_all("sp")
		for speech in speeches:
			speech_id += 1
			speaker_tag = speech.find("speaker")
			if speaker_tag:
				speaker = speaker_tag.text
				speaker = speaker.replace('\n', ' ')
			else:
				speaker = 'n/a'
			p = speech.find_all("p")
			para_texts = [para.text for para in p]
			raw_text = ' '.join(para_texts)
			raw_text = raw_text.replace('\t', ' ')
			raw_text = raw_text.replace('\n', ' ')
			lines = raw_text.split()
			raw_text = ' '.join(lines)

			speeches_writer.write('{}\t{}\t{}\t{}\t{}\t{}\n'.format(tome, speech_id, session_id, date, speaker, raw_text))

speeches_writer.close()

I also wrote a similar script to go through all sessions and to pull out all interesting information related exclusively to sessions, like tome, session id, organization (ex. Assemblée Nationale), president, and date of session. This information was read and written with my script total_corpus_session.py and was saved to all_sessions.tsv in my raw_output folder.