[NTLUG:Discuss] Scriptable, javascript-aware web browser OR virtual operator
Carl Haddick
sysmail at glade.net
Thu Aug 30 09:30:02 CDT 2007
> Leroy Tennison wrote:
>
> > I need to "screen scrape" a generated web page which is generated by
> > filling in a form on a previous page. I haven't had any success finding
> > a solution yet so I'm hoping someone here can help.
>
I never contribute like I should. Please let me know if a script
example like this is OK as an occasional offering.
The problem got me interested in re-learning http connections in Python.
First I went to www.timezoneconverter.com/cgi-bin/tzc/tzc with Firefox.
That web page has a form which lets you convert times from one time zone
representation to another.
I chose to convert Jamaica time, mon, to GMT, but before I hit the
'convert time now' button I turned on tcpdump in a root shell to capture
the http dialog:
tcpdump -n -s 0 -w http.dump tcp port 80
Then, I hit the convert time button. In the root shell window, I
control-c'd the tcpdump process and chowned http.dump to my user rights.
Opening up http.dump in ethereal (new versions are named something
else), I found the url encoded 'post' values in the first line after the
headers.
I also just copied the headers from ethereal. I used the 'follow tcp
stream' under the 'analyze' pull-down menu.
Using tcpdump saved me from looking through html for the form and all
the input fields. Easier to copy than create.
The headers and the url encoded values I plugged into a short Python
script. The work is done by the calls to httplib.HTTPConnection,
tzconvert.request, tzrequest.getresponse, and a 'search' using a
compiled regular expression.
When I run this script from a command line it prints GMT for local time,
assuming local in Jamaica. Time zone issues aside, it does this by
surfing a web site and pulling information out of a returned web page.
If I were smarter I could no doubt think of far better ways, but, as Ma
always said, it sucks to be me.
Leroy, if you need help, email me. If the solution is complex I would
not be available on an upaid basis, but on the other hand I've lurked
here for years. Might not hurt to contribute something back.
Script follows.
Regards,
Carl
#!/usr/local/bin/python
import httplib,urllib,re
tzres=re.compile('<b>([0-9]{2}:[0-9]{2}:[0-9]{2} [0-9a-z,\s]+)</b>',re.M|re.I)
postdata=urllib.urlencode({
'style':'1',
'use_current_datetime':'1',
'month':'8',
'day':'30',
'year':'2007',
'time':'13:34:39',
'time_type':'24hour',
'fromzone':'Jamaica',
'tozone':'GMT',
'Submit.x':'92',
'Submit.y':'12',
'Submit':'Convert'
})
headers={
'Host':'www.timezoneconverter.com',
'User-Agent':'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20051111 Firefox/1.5',
'Content-type':'application/x-www-form-urlencoded',
'Accept':'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Language':'en-us,en;q=0.5',
'Accept-Encoding':'gzip,deflate',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive':'300',
'Connection':'keep-alive',
'Referer':'http://www.timezoneconverter.com/cgi-bin/tzc.tzc',
}
tzconvert=httplib.HTTPConnection('www.timezoneconverter.com')
tzconvert.request('POST','/cgi-bin/tzc.tzc',postdata,headers)
verthtml=tzconvert.getresponse()
if verthtml.status==200 and verthtml.reason=='OK':
srch=tzres.search(verthtml.read())
if srch:
print 'Right now in Jamaica, mon, it\'s %s GMT'%srch.group(1)
else:
print 'Conversion not found, mon, call it \'island time\'.'
else:
print 'Bad ju-ju in the islands, mon. Bad HTML request.'
More information about the Discuss
mailing list