how to extract comments from word python


#!/usr/bin/env python
# Given a .docx file, extract a CSV list of all tagged (commented) text
# This is version 6.0 of the script
# Date: 12 February 2020

import zipfile
import csv
from bs4 import BeautifulSoup as Soup
import tkinter as tk
from tkinter import filedialog
import re

# Show file selection dialog box
root = tk.Tk()
root.withdraw()
paths = filedialog.askopenfilenames()
root.update()

with open('/'.join(paths[0].split('/')[0:-1])+'/output.csv', 'w', newline='', encoding='utf-8-sig') as f:
	csvw = csv.writer(f)
	# loop through each selected file
	for path in paths:
		# Write a header line with the filename
		csvw.writerow([path.split('/')[-1], ''])
		# .docx files are really ZIP files with a separate 'file' within them for the document
		# itself and the text of the comments. This unzips the file and parses the comments.xml
		# file within it, which contains the comment (label) text
		unzip = zipfile.ZipFile(path)
		comments = Soup(unzip.read('word/comments.xml'), 'lxml')
		# The structure of the document itself is more complex and we need to do some
		# preprocessing to handle multi-paragraph and nested comments, so we unzip
		# it into a string first
		doc = unzip.read('word/document.xml').decode()
		# Find all the comment start and end locations and store them in dictionaries
		# keyed on the unique ID for each comment
		start_loc = {x.group(1): x.start() for x in re.finditer(r'<w:commentRangeStart.*?w:id="(.*?)"', doc)}
		end_loc = {x.group(1): x.end() for x in re.finditer(r'<w:commentRangeEnd.*?w:id="(.*?)".*?>', doc)}
		# loop through all the comments in the comments.xml file
		for c in comments.find_all('w:comment'):
			c_id = c.attrs['w:id']
			# Use the locations we found earlier to extract the xml fragment from the document for
			# each comment ID, adding spaces to separate any paragraphs in multi-paragraph comments
			xml = re.sub(r'(<w:p .*?>)', r'\1 ', doc[start_loc[c_id]:end_loc[c_id] + 1])
			# Parse the XML fragment, extract any text and write to file along with the label text
			csvw.writerow([''.join(c.findAll(text=True)), ''.join(Soup(xml, 'lxml').findAll(text=True))])		unzip.close()

Add Own solution

Are there any code examples left?

Find Add Code snippet

New code examples in category TypeScript

TypeScript 2022-03-27 19:30:45 typescript promise
TypeScript 2022-03-27 17:25:44 how to search for imports in vscode
TypeScript 2022-03-27 17:15:20 angular formgroup mark as touched
TypeScript 2022-03-27 17:05:06 use of slice and splice add elements array
TypeScript 2022-03-27 16:50:23 android studio loop through all objects in layout
TypeScript 2022-03-27 15:55:26 Given an array, A, of integers, print N's elements in reverse order as a single line of space-separated numbers.
TypeScript 2022-03-27 14:35:08 wergensherts meaning
TypeScript 2022-03-27 13:50:15 remove all the elements from a numpy array python
TypeScript 2022-03-27 12:35:49 redux toolkit typescript install
TypeScript 2022-03-27 12:35:30 laravel middleware for apis

Create a Free Account

Unlock the power of data and AI by diving into Python, ChatGPT, SQL, Power BI, and beyond.

Develop soft skills on BrainApps

Complete the IQ Test

Relative searches

commnet extraction form word python

how to extract comments from word python

Welcome Back!

Create a Free Account