[NTLUG:Discuss] Scriptable, javascript-aware web browser OR virtual operator

Leroy Tennison leroy_tennison at prodigy.net
Thu Aug 30 21:26:34 CDT 2007


Carl Haddick wrote:
>> Leroy Tennison wrote:
>>
>>  > I need to "screen scrape" a generated web page which is generated by
>>  > filling in a form on a previous page. I haven't had any success finding
>>  > a solution yet so I'm hoping someone here can help.
>>
> 
> I never contribute like I should.  Please let me know if a script
> example like this is OK as an occasional offering.
> 
> The problem got me interested in re-learning http connections in Python.
> 
> First I went to www.timezoneconverter.com/cgi-bin/tzc/tzc with Firefox.
> That web page has a form which lets you convert times from one time zone
> representation to another.
> 
> I chose to convert Jamaica time, mon, to GMT, but before I hit the
> 'convert time now' button I turned on tcpdump in a root shell to capture
> the http dialog:
> 
>     tcpdump -n -s 0 -w http.dump tcp port 80
> 
> Then, I hit the convert time button.  In the root shell window, I
> control-c'd the tcpdump process and chowned http.dump to my user rights.
> 
> Opening up http.dump in ethereal (new versions are named something
> else), I found the url encoded 'post' values in the first line after the
> headers.
> 
> I also just copied the headers from ethereal.  I used the 'follow tcp
> stream' under the 'analyze' pull-down menu.
> 
> Using tcpdump saved me from looking through html for the form and all
> the input fields.  Easier to copy than create.
> 
> The headers and the url encoded values I plugged into a short Python
> script.  The work is done by the calls to httplib.HTTPConnection,
> tzconvert.request, tzrequest.getresponse, and a 'search' using a
> compiled regular expression.
> 
> When I run this script from a command line it prints GMT for local time,
> assuming local in Jamaica.  Time zone issues aside, it does this by
> surfing a web site and pulling information out of a returned web page.
> 
> If I were smarter I could no doubt think of far better ways, but, as Ma
> always said, it sucks to be me.
> 
> Leroy, if you need help, email me.  If the solution is complex I would
> not be available on an upaid basis, but on the other hand I've lurked
> here for years.  Might not hurt to contribute something back.
> 
> Script follows.
> 
> Regards,
> 
> Carl
> 
> #!/usr/local/bin/python
> 
> import httplib,urllib,re
> 
> tzres=re.compile('<b>([0-9]{2}:[0-9]{2}:[0-9]{2} [0-9a-z,\s]+)</b>',re.M|re.I)
> 
> postdata=urllib.urlencode({
> 	'style':'1',
> 	'use_current_datetime':'1',
> 	'month':'8',
> 	'day':'30',
> 	'year':'2007',
> 	'time':'13:34:39',
> 	'time_type':'24hour',
> 	'fromzone':'Jamaica',
> 	'tozone':'GMT',
> 	'Submit.x':'92',
> 	'Submit.y':'12',
> 	'Submit':'Convert'
> 	})
> 
> headers={
>     'Host':'www.timezoneconverter.com',
>     'User-Agent':'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20051111 Firefox/1.5',
>     'Content-type':'application/x-www-form-urlencoded',
>     'Accept':'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
>     'Accept-Language':'en-us,en;q=0.5',
>     'Accept-Encoding':'gzip,deflate',
>     'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
>     'Keep-Alive':'300',
>     'Connection':'keep-alive',
>     'Referer':'http://www.timezoneconverter.com/cgi-bin/tzc.tzc',
>     }
> tzconvert=httplib.HTTPConnection('www.timezoneconverter.com')
> tzconvert.request('POST','/cgi-bin/tzc.tzc',postdata,headers)
> verthtml=tzconvert.getresponse()
> if verthtml.status==200 and verthtml.reason=='OK':
>     srch=tzres.search(verthtml.read())
>     if srch:
> 	print 'Right now in Jamaica, mon, it\'s %s GMT'%srch.group(1)
>     else:
> 	print 'Conversion not found, mon, call it \'island time\'.'
> else:
>     print 'Bad ju-ju in the islands, mon. Bad HTML request.'
> 
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
> 
Thank you, you indirectly provided something I was looking for: a site 
that wanted a POST response.  I'm not sure I understand why you used 
tcpdump rather than Etherreal/Wireshark other than to be able to copy 
something out of it, am I missing something?

Also, thank you for the script, I'll analyze it over the weekend.



More information about the Discuss mailing list