I am doing a little data scraping, There are 3 types of file from which i am scraping data.
1- HTML
2- PDF
3- Excel(xls)
For HTML i am comfortable, i am using HTML Agility for that.
For PDF and excel i need suggestions from anyone.
Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java
environment look at Apache POI.
EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM
I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF
files into Excel. We have used it with great success.
HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction
purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three
sources you have specified. Google it.
Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel
No comments:
Post a Comment