I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python
mechanize and beautifulsoup to do the scraping.
I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is
streamed into the table and mechanize.Browser() just stops listening.
So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500
contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use
browser.open() at that url all you get is the first ~5 rows.
I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a
time.sleep(10) between the .open() and .read() but that didn't make much difference.
And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-
source'). SO I don't think its a javascript issue.
Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.
-Will
2 Answers
Why not use an html parser like lxml with xpath expressions.
I tried
>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>
Similarly, you can create xpath expression of your choice to work with.
Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-
fashion
mechanize and beautifulsoup to do the scraping.
I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is
streamed into the table and mechanize.Browser() just stops listening.
So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500
contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use
browser.open() at that url all you get is the first ~5 rows.
I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a
time.sleep(10) between the .open() and .read() but that didn't make much difference.
And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-
source'). SO I don't think its a javascript issue.
Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.
-Will
2 Answers
Why not use an html parser like lxml with xpath expressions.
I tried
>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>
Similarly, you can create xpath expression of your choice to work with.
Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-
fashion