Microsoft Office 2007 introduced a new XML based file format for the Office suite of products. Word files use extension “.doc” for Office 2003 and earlier with “.docx” for Office 2007, etc. Most likely, none of this is new to you.
One thing you may not know, however, is the new XML based file formats are actually compressed file which you can view using a zip client. For this article, we are going to dig into the inner contents of a Word 2007 file using 7-Zip to extract images, text and embedded files.
Viewing the Internal Contents of an Office 2007 Document
Consider the document below.
Right click on the document and select Open archive from the 7-Zip context menu.
The document contains a folder structure and XML files which contain all the data used to render the respective Word file.
Extracting Images from an Office 2007 Document
You can view all the embedded images inside of the “wordmedia” folder.
These image files can be extracted from the document the same way you would extract files from a standard zip file. For example, you can drag and drop the entire “media” folder to your desktop to extract all the images in the document.
The extracted files are the original images used by the document. Inside the document, there may be resizing or other properties set but the extracted file are the raw images without these properties applied.
Extracting Text from an Office 2007 Document
The text you see in the Word documents comes from the file “mediadocument.xml” inside of the inner contents. By opening this file in an XML viewer such as XML Notepad 2007, you can see all of the copy in plain text regardless of the style and/or formatting applied in the document itself.
Extracting Embedded Files from an Office 2007 Document
Extracting embedded OLE objects and/or attached files do not work as seamlessly. While you can find and extract the respective resources they all have a “.bin” extension leaving it up to you to figure out the correct file type. Typically you can “trial and error” guessing the name of the file by using the images displaying file names in the document.
Consider this Word document:
The respective names of these embedded files do not help in determining which is which.
By systematically guess-and-checking the extension by using the captions of the embedded files (in our example .mp3 and .pdf), you can figure out which extension goes to which file.
When the file extension is assigned correctly, it should open in the respective program.
As you can see, this process is pretty simple. Using the same methodology, you do not have to stop at just images and text. You can also view style information, printer setup and any other properties specific to the document.
While the example illustrated above covers Word documents, you can just as easily extract the same information from other Office 2007 format documents such as Excel and PowerPoint. The name and location of the respective data is in a logical location within the inner contents of the document, but if nothing else you can always extract resource files you are unsure of and see what they contain.
- Published 07/16/10