SEARCH

How-To Geek

Easily Extract Images, Text and Embedded Files from an Office 2007/2010 Document

Microsoft Office 2007 introduced a new XML based file format for the Office suite of products. Word files use extension “.doc” for Office 2003 and earlier with “.docx” for Office 2007, etc. Most likely, none of this is new to you.

One thing you may not know, however, is the new XML based file formats are actually compressed file which you can view using a zip client. For this article, we are going to dig into the inner contents of a Word 2007 file using 7-Zip to extract images, text and embedded files.

Viewing the Internal Contents of an Office 2007 Document

Consider the document below.

image

Right click on the document and select Open archive from the 7-Zip context menu.

image

The document contains a folder structure and XML files which contain all the data used to render the respective Word file.

image

Extracting Images from an Office 2007 Document

You can view all the embedded images inside of the “wordmedia” folder.

image

These image files can be extracted from the document the same way you would extract files from a standard zip file. For example, you can drag and drop the entire “media” folder to your desktop to extract all the images in the document.

image

The extracted files are the original images used by the document. Inside the document, there may be resizing or other properties set but the extracted file are the raw images without these properties applied.

image

Extracting Text from an Office 2007 Document

The text you see in the Word documents comes from the file “mediadocument.xml” inside of the inner contents. By opening this file in an XML viewer such as XML Notepad 2007, you can see all of the copy in plain text regardless of the style and/or formatting applied in the document itself.

image

Extracting Embedded Files from an Office 2007 Document

Extracting embedded OLE objects and/or attached files do not work as seamlessly. While you can find and extract the respective resources they all have a “.bin” extension leaving it up to you to figure out the correct file type. Typically you can “trial and error” guessing the name of the file by using the images displaying file names in the document.

Consider this Word document:

image

The respective names of these embedded files do not help in determining which is which.

image

By systematically guess-and-checking the extension by using the captions of the embedded files (in our example .mp3 and .pdf), you can figure out which extension goes to which file.

image

When the file extension is assigned correctly, it should open in the respective program.

image

Conclusion

As you can see, this process is pretty simple. Using the same methodology, you do not have to stop at just images and text. You can also view style information, printer setup and any other properties specific to the document.

While the example illustrated above covers Word documents, you can just as easily extract the same information from other Office 2007 format documents such as Excel and PowerPoint. The name and location of the respective data is in a logical location within the inner contents of the document, but if nothing else you can always extract resource files you are unsure of and see what they contain.

Links

Download 7-Zip

Download XML Notepad 2007

Office 2007 File Format Details

Jason Faulkner is a developer and IT professional who never has a hot cup of coffee far away. Interact with him on Google+

  • Published 07/16/10

Comments (3)

  1. JWLapworth

    I wish I had seen this article earlier…

    After several failed attempts to extract an image for which I had “misplaced” the original metafile, this post finally enabled me to recover the file. This has saved my day!

  2. Patrick J Levy

    Great! Fantastic!

    I just have to process 500 images in more several word documents and I had no idea how to extract it; Thanks very much!!

  3. hi, about a way to extract images and rename them by caption?

Get Free Articles in Your Inbox!

Join 134,000 newsletter readers

Email:

Go check your email!