• ARTICLES
SEARCH

How-To Geek

How Can I Copy Text from a PDF while Preserving the Formatting?

PDF, the ubiquitous document format, is great for sharing documents while preserving fonts, images, and the general layout across platforms. Is there an easy way, however, to preserve that very formatting when copying and pasting text out of the document?

Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.

The Question

SuperUser reader Colen is searching for a way to extract text from PDFs while preserving the formatting:

When I copy text out of a PDF file and into a text editor, it ends up mangled in a variety of ways. Formatting like bold and italics are lost; soft line breaks within a paragraph of text are converted to hard line breaks; dashes to break a word over two lines are preserved even when they shouldn’t be; and single and double quotes are replaced with ? signs.

Ideally, I’d like to be able to copy text from a PDF and have formatting converted to HTML codes, “smart quotes” converted to ” and ‘, and line breaks done properly. Is there any way to do this?

Is there a quick and easy way for Colen (and the rest of us) to get grab text without sacrificing the formatting?

The Answer

SuperUser contributor Frabjous offers a solution combined with a heavy dose of caution:

Firstly, you have to understand what a PDF is. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. hard breaks for paragraph endings.

(A few recent PDFs do store some information about this stuff, but that’s a new technology, and you’d be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it.)

Anyway, it’s up to your software to implement some kind of “artificial intelligence” to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it’s also going to depend on how the PDF was made. In any case, you should never expect perfect results. Having the output PDF is not the same as having the source document. Far better to try to obtain that if you can.

The standard solution to your kind of problem is to use Adobe Acrobat Professional (the expensive one, not the free reader) to convert the PDF to HTML. Even that is not going to get perfect results.

There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don’t expect perfect results. See, e.g., calibre (which can convert to RTF format), pdftohtml/pdfreflow, or the AbiWord word processor (with all import/export plugins enabled). There’s also a PDF import plugin for OpenOffice.

But please don’t expect perfection with any of these results. You’re going against the grain here. PDF just is not meant as an editable input format.

If you are having trouble deciding which tool to start with, Calibre is a veritable document Swiss Army knife. You can also use it to convert PDF files for use on your ebook reader and organize your ebook/document library.


Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.

Jason Fitzpatrick is warranty-voiding DIYer and all around geek. When he's not documenting mods and hacks he's doing his best to make sure a generation of college students graduate knowing they should put their pants on one leg at a time and go on to greatness, just like Bruce Dickinson. You can follow him on if you'd like.

  • Published 02/7/13

Comments (25)

  1. michel

    Doesn’t the newly released Office 2013 open pdf’s for editing? Open,. select all, copy, paste.

  2. GeekinTexas

    Is there a quick and easy way for Colen (and the rest of us) to get grab text without sacrificing the formatting?

    So basically, this long article answered the question and said:

    No.

  3. sidharth

    Great find. I was actually looking for a solution to copy the texts from pdf. But as you said, I had experienced formating issues. This fixed the problem. Thanks.

  4. Jim Carter

    If you’re going to pay for a PDF tool, there’s nothing better than Nuance PDF Converter Professional 8. I own a PC repair/sales business and rely on this application daily.

  5. StevenTorrey

    Many authors take pride in their authorship. PDF to a certain extent protects that production.

  6. Bob

    thanks Jim Carter, great input!

  7. ron

    Yes, Word 2013 does read some PDFs. As you say, it does the equivalent of a Select all text, copy and paste for embedded text as well as capturing images. So if the PDF is setup to place all of the text in a single large image, Word 2013 won’t help.

    In that case you can use OneNote 2007 / 2010 / 2013. It can perform OCR, Optical Character Recognition, to extract text from image files.

  8. Little John

    I use several differ programs to copy text from PDF files, the best is Adobe Acrobat but it’s price is high. Other I use is Nuance OmniPage to convert complete PDF to Word files. True the new Office 2013 does read PDF files but not all PDF files, for example PDF file created by using jpg files. If the PDF was created using a text editor then Word has no problem opening the file.

    Other way to create a text file is to read the text into Dragon Speaking Naturally, but you will lose the formatting.

  9. John Keegan

    I also have to copy information from PDF files. I paste the information into Notepad to remove the formatting, and then into the document I’m working on. Obvious I know, but it works.

  10. Jack

    I cannot even follow the explanation.

  11. Tracie

    The new Adobe Acrobat XI will allow you to copy and paste while retaining the formatting. Enough said.

  12. Rafael

    Redo the article when I can “expect perfect results”. Partial results reminds me of the series “Get Smart, missed it by the much.”

  13. Scott

    I like Sumatra PDF (it’s free):

    ‘What is Sumatra PDF?

    Sumatra PDF is a free PDF, eBook (ePub, Mobi), XPS, DjVu, CHM, Comic Book (CBZ and CBR) reader for Windows.

    Sumatra PDF is small, portable and starts up very fast.

    Simplicity of the user interface has a high priority.’

    http://blog.kowalczyk.info/software/sumatrapdf/free-pdf-reader.html

  14. mikmik

    What Tracie said.

  15. Stuck in Kentuck

    @Jim Carter
    Nuance PDF is well worth the money. I use it every day.
    Tip: Nuance is very cooperative when told not to telephone offering “deals” on other Nuance products. The calls do stop.

  16. Joshua

    I found OneNote difficult to use. I tried it specifically for ORC. Then I remembered I had access to Adobe CS5 with Acrobat. ORC has work perfectly so far and it couldn’t be simpler.

  17. Bob_WA

    Thank you Frabjous for the explanation. So if I understand it correctly, when a pdf reader (I use Nitro) does a “select text” operation, it is not copying a line of text from the document source but actually doing a quick OCR operation on what appears on the display?

  18. Bernard

    As Tracie wrote, use Adobe Acrobat XI. It allows almost perfect copy and paste. It just seems to lose blank lines.

  19. Dic

    Don’t know what Colen wants to do with it, but of course for basic purposes a screenshot would grab selected text. (Many people new to computing, or sometimes not, still don’t know what the PrtScn – print screen – keyboard key does.)

  20. Chris

    Able2Doc Professional does a very good job of converting to an MS Word doc most of the time. If you want to convert PDF drawings into a dxf file to edit with AutoCAD Aide PDF to DXF Converter will do the job; it’s not perfect though and can’t convert directly to a dxf with metric dimensions.

  21. Loren

    Since I use CoreDraw in my business, most PDFs can be opened in CorelDraw but as noted nothing is perfect. I suspect that users with other Adobe programs can open them too. Corel is especally useful to open PDFs with graphics. But of course if you don’t need CorelDraw for other purposes, it is just as expensive as Acrobat.

  22. john3347

    “How Can I Copy Text from a PDF while Preserving the Formatting?”

    So really this whole article can be summed up in two words: “You can’t”. Perhaps you can copy the information contained in a text, but you will not accurately and reliably save formatting.

  23. joe

    @john3347: that’s what I was thinking. The entire article could be summed up within the headline.

  24. Peter Ridgers

    Actually the article is more than the headline and does make an attempt to explain what easily can and cannot be converted from PDF to other document formats.

    At the end of the day the best way to get original text from a document, complete with formatting, is to take it from the original file, not the derived PDF. Again, to modify a PDF the very best way is to modify (a copy) of the original and reproduce the PDF.

    If you don’t have access to the original that’s your fault not the fault of the writers of the PDF specification.

  25. DarkWinterNights

    On a side note, Sony ereaders already include software for reflowing PDF text into coherent paragraphs. It strikes me it wouldn’t be a huge jump for them to reuse the API.

Get Free Articles in Your Inbox!

Join 134,000 newsletter readers

Email:

Go check your email!