[NTLUG:Discuss] Scriptable, javascript-aware web browser OR virtual operator

Carl Haddick sysmail at glade.net
Thu Aug 30 09:30:02 CDT 2007


> Leroy Tennison wrote:
> 
>  > I need to "screen scrape" a generated web page which is generated by
>  > filling in a form on a previous page. I haven't had any success finding
>  > a solution yet so I'm hoping someone here can help.
> 

I never contribute like I should.  Please let me know if a script
example like this is OK as an occasional offering.

The problem got me interested in re-learning http connections in Python.

First I went to www.timezoneconverter.com/cgi-bin/tzc/tzc with Firefox.
That web page has a form which lets you convert times from one time zone
representation to another.

I chose to convert Jamaica time, mon, to GMT, but before I hit the
'convert time now' button I turned on tcpdump in a root shell to capture
the http dialog:

    tcpdump -n -s 0 -w http.dump tcp port 80

Then, I hit the convert time button.  In the root shell window, I
control-c'd the tcpdump process and chowned http.dump to my user rights.

Opening up http.dump in ethereal (new versions are named something
else), I found the url encoded 'post' values in the first line after the
headers.

I also just copied the headers from ethereal.  I used the 'follow tcp
stream' under the 'analyze' pull-down menu.

Using tcpdump saved me from looking through html for the form and all
the input fields.  Easier to copy than create.

The headers and the url encoded values I plugged into a short Python
script.  The work is done by the calls to httplib.HTTPConnection,
tzconvert.request, tzrequest.getresponse, and a 'search' using a
compiled regular expression.

When I run this script from a command line it prints GMT for local time,
assuming local in Jamaica.  Time zone issues aside, it does this by
surfing a web site and pulling information out of a returned web page.

If I were smarter I could no doubt think of far better ways, but, as Ma
always said, it sucks to be me.

Leroy, if you need help, email me.  If the solution is complex I would
not be available on an upaid basis, but on the other hand I've lurked
here for years.  Might not hurt to contribute something back.

Script follows.

Regards,

Carl

#!/usr/local/bin/python

import httplib,urllib,re

tzres=re.compile('<b>([0-9]{2}:[0-9]{2}:[0-9]{2} [0-9a-z,\s]+)</b>',re.M|re.I)

postdata=urllib.urlencode({
	'style':'1',
	'use_current_datetime':'1',
	'month':'8',
	'day':'30',
	'year':'2007',
	'time':'13:34:39',
	'time_type':'24hour',
	'fromzone':'Jamaica',
	'tozone':'GMT',
	'Submit.x':'92',
	'Submit.y':'12',
	'Submit':'Convert'
	})

headers={
    'Host':'www.timezoneconverter.com',
    'User-Agent':'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20051111 Firefox/1.5',
    'Content-type':'application/x-www-form-urlencoded',
    'Accept':'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
    'Accept-Language':'en-us,en;q=0.5',
    'Accept-Encoding':'gzip,deflate',
    'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
    'Keep-Alive':'300',
    'Connection':'keep-alive',
    'Referer':'http://www.timezoneconverter.com/cgi-bin/tzc.tzc',
    }
tzconvert=httplib.HTTPConnection('www.timezoneconverter.com')
tzconvert.request('POST','/cgi-bin/tzc.tzc',postdata,headers)
verthtml=tzconvert.getresponse()
if verthtml.status==200 and verthtml.reason=='OK':
    srch=tzres.search(verthtml.read())
    if srch:
	print 'Right now in Jamaica, mon, it\'s %s GMT'%srch.group(1)
    else:
	print 'Conversion not found, mon, call it \'island time\'.'
else:
    print 'Bad ju-ju in the islands, mon. Bad HTML request.'



More information about the Discuss mailing list