[NTLUG:Discuss] Scriptable, javascript-aware web browser OR virtual operator
Leroy Tennison
leroy_tennison at prodigy.net
Thu Aug 30 21:26:34 CDT 2007
Carl Haddick wrote:
>> Leroy Tennison wrote:
>>
>> > I need to "screen scrape" a generated web page which is generated by
>> > filling in a form on a previous page. I haven't had any success finding
>> > a solution yet so I'm hoping someone here can help.
>>
>
> I never contribute like I should. Please let me know if a script
> example like this is OK as an occasional offering.
>
> The problem got me interested in re-learning http connections in Python.
>
> First I went to www.timezoneconverter.com/cgi-bin/tzc/tzc with Firefox.
> That web page has a form which lets you convert times from one time zone
> representation to another.
>
> I chose to convert Jamaica time, mon, to GMT, but before I hit the
> 'convert time now' button I turned on tcpdump in a root shell to capture
> the http dialog:
>
> tcpdump -n -s 0 -w http.dump tcp port 80
>
> Then, I hit the convert time button. In the root shell window, I
> control-c'd the tcpdump process and chowned http.dump to my user rights.
>
> Opening up http.dump in ethereal (new versions are named something
> else), I found the url encoded 'post' values in the first line after the
> headers.
>
> I also just copied the headers from ethereal. I used the 'follow tcp
> stream' under the 'analyze' pull-down menu.
>
> Using tcpdump saved me from looking through html for the form and all
> the input fields. Easier to copy than create.
>
> The headers and the url encoded values I plugged into a short Python
> script. The work is done by the calls to httplib.HTTPConnection,
> tzconvert.request, tzrequest.getresponse, and a 'search' using a
> compiled regular expression.
>
> When I run this script from a command line it prints GMT for local time,
> assuming local in Jamaica. Time zone issues aside, it does this by
> surfing a web site and pulling information out of a returned web page.
>
> If I were smarter I could no doubt think of far better ways, but, as Ma
> always said, it sucks to be me.
>
> Leroy, if you need help, email me. If the solution is complex I would
> not be available on an upaid basis, but on the other hand I've lurked
> here for years. Might not hurt to contribute something back.
>
> Script follows.
>
> Regards,
>
> Carl
>
> #!/usr/local/bin/python
>
> import httplib,urllib,re
>
> tzres=re.compile('<b>([0-9]{2}:[0-9]{2}:[0-9]{2} [0-9a-z,\s]+)</b>',re.M|re.I)
>
> postdata=urllib.urlencode({
> 'style':'1',
> 'use_current_datetime':'1',
> 'month':'8',
> 'day':'30',
> 'year':'2007',
> 'time':'13:34:39',
> 'time_type':'24hour',
> 'fromzone':'Jamaica',
> 'tozone':'GMT',
> 'Submit.x':'92',
> 'Submit.y':'12',
> 'Submit':'Convert'
> })
>
> headers={
> 'Host':'www.timezoneconverter.com',
> 'User-Agent':'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20051111 Firefox/1.5',
> 'Content-type':'application/x-www-form-urlencoded',
> 'Accept':'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
> 'Accept-Language':'en-us,en;q=0.5',
> 'Accept-Encoding':'gzip,deflate',
> 'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
> 'Keep-Alive':'300',
> 'Connection':'keep-alive',
> 'Referer':'http://www.timezoneconverter.com/cgi-bin/tzc.tzc',
> }
> tzconvert=httplib.HTTPConnection('www.timezoneconverter.com')
> tzconvert.request('POST','/cgi-bin/tzc.tzc',postdata,headers)
> verthtml=tzconvert.getresponse()
> if verthtml.status==200 and verthtml.reason=='OK':
> srch=tzres.search(verthtml.read())
> if srch:
> print 'Right now in Jamaica, mon, it\'s %s GMT'%srch.group(1)
> else:
> print 'Conversion not found, mon, call it \'island time\'.'
> else:
> print 'Bad ju-ju in the islands, mon. Bad HTML request.'
>
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
>
Thank you, you indirectly provided something I was looking for: a site
that wanted a POST response. I'm not sure I understand why you used
tcpdump rather than Etherreal/Wireshark other than to be able to copy
something out of it, am I missing something?
Also, thank you for the script, I'll analyze it over the weekend.
More information about the Discuss
mailing list