Our client is a large-scale hedge fund based in New York City, USA. The company processes a massive amount of documents every day to analyze market trends and increase the accuracy of management decision-making. To accelerate this area of analytics, our client was looking to build a tool that would assist in gathering data from PDF documents.
The project’s goal was to develop a backend tool able to recognize text formatted as a PDF table to allow the automated processing of data contained in PDF files. The tool would convert a PDF into CSV or other formats that could be parsed by analytics tools used by our client to generate valuable insights from available data.
PDF is one of the most popular formats for reports today thanks to its guarantee of compatibility across different applications. One of our experienced software engineers was designated to help the company boost its analytics capabilities and beat the competition. Our developer analyzed the problem and delivered a proof of concept (POC) to be added to our client’s analytics solution.
The solution developed by our engineer consists of two components. The first one converts PDF files into a binary format that can be processed by a backend programming language. The second one uses a host of different criteria to identify and parse tables. Our developer equipped the solution with powerful machine learning capabilities that enable dealing with complex cases – for example, reports featuring multiple images and tables.