Parsing .docx document with Python

In this chapter, we are going to learn how we can parse a .docx extension format file with Python. Python has a special package for it called 'python-docx'. So, let's start,

First of all run this command for installing python-docx:

pip install python-docx

Now,consider you have a .docx extension file which contains headings, paragraphs, images, bullets etc. In this chapter, we are going to extract that each one part of .docx file using Python.

ex. file name is 'test.docx'

  from docx import Document

  document = Document('test.docx')
  document.save('test.docx')

Now, we can access our 'test.docx' file in python.

1.Paragraph parsing:

  for para in document.paragraphs:
    print para.text

This gives us content within all paragraphs of 'test.docx'.

2.Table parsing:

  for table in document.tables:
    for row in table.rows:
      for cell in row.cells:
        for para in cell.paragraphs:
          print para.text

It will print all the content within every cell of available tables.

3.Image parsing:

  for image in document.inline_shapes:
    print image.width, image.height

This gives us width and height of images available in given file.

4.Heading parsing:

  for content in document.paragraphs:
    if content.style.name=='Heading 1' or content.style.name=='Heading 2' or content.style.name=='Heading 3':
      print content.text

5.

results matching ""

    No results matching ""