Ticker

6/recent/ticker-posts

Ad Code

Responsive Advertisement

Linux Fu: PDF for Penguins

PostScript started out as a programming language for printers. While PostScript printers are still a thing, there are many other ways to send data to a printer. But PostScript also spawned the Portable Document Format or PDF and that has been crazy successful. Hardly a day goes by that you don’t see some kind of PDF document come across your computer screen. Sure, there are other competing formats but they hold a sliver of market share compared to PDF. Viewing PDFs under Linux is no problem. But what about editing them? Turns out, that’s easy, too, if you know how.

GUI Tools

You can use lots of tools to edit PDF files, but the trick is how good the results will look. Anything will work for this: LibreOffice Draw, Inkscape, or even GIMP. If all you want to do is remove something with a white box or make an annotation, these tools are usually great, but for more complicated changes, or pixel-perfect output, they may not be the right tool.

The biggest problem is that most of these tools deal with the PDF as an image or, at least, a collection of objects. For example, columns of text will probably turn into a collection of discrete lines. Changing something that causes a line to wrap will require you to change all the other lines to match. Sometimes text isn’t even text at all, but images. It largely depends on how the creator made the PDF to begin with.If you don’t mind using a Web-based tool, PDFEscape is free and works very well. Other options include Scribus and Okular. Both of these tools can’t really edit the file but can import them as images that you can further manipulate. For example, Okular’s review mode can add annotations like highlights and freehand lines.

Unsurprisingly, emacs can display a PDF file if it is running under X. You can use Control+C Control+C to switch to view a text representation. After all, most of the PDF file format is text and emacs can even handle binary files. So if you don’t mind working inside the PDF format — very much like PostScript — you can do your editing in emacs or even another text editor.

There are a few dedicated non-free editors out there and at least one open-source PDF-specific editor. Of course, like most things in Linux, you can also use the command line.

Hiding Text

The problem with working with PDFs as text — even in emacs — is that they are often compressed and otherwise unreadable. For example, words may appear a character at a time separated by formatting code or other data. So searching for Hackaday in the PDF may not work.

You can convert the file to use more uncompressed text, although that’s no panacea. For example, if you open up this segment from an article on ham radio and want to change the word “convention”, it is hard to tell exactly where that text is, but it is somewhere in this general area:

3 0 obj << /Length 14770 /Filter /FlateDecode >> stream
H�|WÉ’�8��+p$gJ,�c��v�cS�ÒŒc��J�$���\ZV�����\0�� �CTR�������r��[�}�7}����|��������I5u���`M�>�/��?l�.8�@��gBzq�r!#�%� AE�� �˜ ᥉��x!$��X8^%$��A�D�B���(���b�[H �>����#��{a���e0$^H&|/����U1$^��#��/�G�Us��/"/��\ <i�'qC���$xe�"X�x22�������G��F�Lp]Mnm�$] #TI��G�q�l��'3;!���!+�È·�{ä•€���
��b��Qja����Q i� GRn�\0�g;L����x�Zܿ㌳�n�2�R& :"x�r�ky�[JPK��/���S��i��������]r�F�p����k�� |���
QI�mx>1�\�1�Q��y)ХǺ�Z�U.^�](pN��dx����;�֬;d�_�{˪�cYa�\�.t�s�}�ْ{<\0ZW�:�È„�OÉ´��cS�UzluP�֨o}ި��Uqf��o��V��bT%mj|��t����;v�{s�Rj˺���

Good luck finding it in that soup. You want to convert it to unpacked text.

qpdf -qdf input.pdf output.txt

The resulting file is actually a PDF even though I named it .txt. However, it has everything unpacked. That’s still not great, though, but at least you could find the part you need to change:

1.2632 -1.1242 TD
0.0739 Tc
0.1263 Tw
(One potentially confusing Stamp)Tj
-1.2632 -1.1368 TD
0.026 Tc
0.1248 Tw
[(con)38.6(v)20.7(ention is that the I/O pin numbers)]TJ
0 -1.1242 TD
0.0262 Tc
0.0072 Tw
[<646f6e90>13.6(t correspond to the IC pin numbers.)]TJ
T*

Again, good luck searching for the word “convention,” for example. But it is still better than the first example. You can also find metadata even in unprocessed files using things like /Author and /Title.

Command Line Magic

The qpdf tool can convert a PDF file to another PDF file. It can optimize the output for Web serving, text editing, and it can do simple things like remove pages or merge pieces of multiple files. You can read the documentation, but here we use the QDF mode to produce a legitimate PDF file with all the objects in numerical order and with normal Unix-style line endings. This allows you to more easily edit the file with a text editor, but as you’ve seen that doesn’t always make it simple. Removing entire objects is a headache, but if you get rid of all the mentions of an object, you can run fix-qdf to recreate the proper QDF file.

Another way to make common edits to PDF files is to use PDFtk server (PDFtk without the server moniker is a GUI toolkit for Windows). Using PDFtk you can merge or split documents, rotate pages, and do many other common tasks. For example, to join two files in order:

pdftk in1.pdf in2.pdf cat output output.pdf

You can omit, say, page 9:

pdftk in1.pdf in2.pdf cat 1-8 10-end output output.pdf

You can also shuffle merged pages in different orders:

pdftk A=in1.pdf B=in2.pdf shuffle A B output output.pdf

Text to PDF and Back

If you want to convert text into PDF from the command line you have several options. Pandoc is an amazing tool that converts markdown to almost anything. It will not only convert markdown to PDF but just about anything else.

You can also use various combinations of ps2pdf (along with a tool to generate PostScript), pdf2text (part of poppler-utils), or Ghostscript to create PDFs or strip text out of them. Ghostscript can do a lot, including convert a PDF to a number of image formats if you want to, say, display them on a Web page as an image.

Special Printing and Other Tools

Sometimes you want to modify a PDF file so it will print a certain way. We’ve already talked about how to merge odd and even pages, for example, but there are a few other commands you might want for this purpose:

  • pdfxup – Uses pdflatex and Ghostscript to put multiple pages on one printed page (e.g., 2-up)
  • pdfjam – Uses LaTeX to put documents on different size pages or produce multiple pages on one printed page
  • pdfposter – Create giant output on multiple pages from a single page

If you prefer a GUI you might check out PDFsam basic. If you are interested in Java software, there is Multivalent.

Wrap Up

As usual, there are many ways to do daily tasks in Linux. Sometimes the challenge isn’t doing the work, but rather finding the tool that best fits your style of working.

Oddly enough pandoc keeps coming up for different reasons. If you prefer your documents on paper, you need a printer and bookbinding clamp.

 

Enregistrer un commentaire

0 Commentaires