Tabula System
Jump to navigation
Jump to search
Tabula System is a free and open-source PDF table extraction system.
- Context:
- It can be developed and maintained by a team led by Manuel Aristarán, Mike Tigas, and Jeremy B. Merrill, with support from ProPublica, La Nación DATA, Knight-Mozilla OpenNews, and The New York Times.
- It can be used by journalists, researchers, and grassroots organizations, enabling them to convert tables found in PDF files into CSV or Microsoft Excel spreadsheets, making the data easier to manipulate and analyze.
- It can work on Mac, Windows, and Linux operating systems.
- It can be funded in part through user donations and grants from the Knight Foundation and the Shuttleworth Foundation.
- ...
- Example(s):
- Tabula, v1.2.1 (2018-04) [1].
- ...
- Counter-Example(s):
- Adobe Acrobat, that can export to Word format or HTML format.
- One that can process a scanned image of a table within a PDF.
- ...
- See: Table Representation Standard, Data Extraction.
References
2018
- https://github.com/tabulapdf/tabula
- QUOTE: If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
- QUOTE: If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface.