Friday, September 18, 2009

Python - Back to Basics: Accessing Online Content

Many of the different scripts that I write in Python deal with accessing content online. I thought that it might be a good idea to go over some of the different methods Python provides, and give some examples of their every day uses.

Most of the functionality you may ever want to access online content can be found in one of two modules (and often you need to use both) : urllib and urllib2.

First situation: You want to download the content of some page, perhaps to scrape the page for information for later processing. You want the content to be stored in a single variable (a string) so that you can do some regular expression matching on it, or some other form of parsing. A few simple lines:


import urllib2

page_contents = urllib2.urlopen("http://www.google.com").read()

print page_contents


Notice how on the second line I have the ".read()" method call. If you leave this off, you will have just the socket file object, which you can later call ".read()" on to get the contents. I like to do it all in one line however. Once you do this, it is a regular string and you can do whatever you want to/with it.

Now, lets say that you don't want to just download the page contents - you want to download an actual file, like an image, video, Power Point Presentation, PDF or anything else. Also, just a few lines:


# Download the Python image from http://python.org
import urllib

# File will be called "Python-Logo.gif",and will be contained in the folder
# where the script was executed
urllib.urlretrieve("http://python.org/images/python-logo.gif", "Python-Logo.gif")


You can use the "urlretrive" method to download pretty much any file you want - you just need to know it's URL and what you want to call it when you download it. Keep in mind that if you don't specify anything but the name of the file, it will be downloaded to the folder where you run the script. You can put things like "C:\images\Python-Logo.gif" in there if you want it to go somewhere else.

Using what we have gone over in the first two example, we can write some useful scripts. W can get the page source, find all of the PDF files linked to on it, and download them all to a folder. I do this very thing for one of the classes that I am currently taking. Here is that script:


import urllib
import re
import os

# Specifiy where I want to download my files
download_folder = "C:\DropBox\My Dropbox\School\CS479\Slides\\"

# Download page contents
# Note: I am using urllib, not urllib2, but only because urlretrieve does not exist in urllib2
# Both have this method, and do pretty much the same thing (for this kind of usage, that is)
page_contents = urllib.urlopen("https://cswiki.cs.byu.edu/cs479/index.php/Lecture_slides").read()

# Use regular expression to find all pdf files on site.
# match.group(1) will contain the link to the file
# match.group(2) will contain the name of the file
for match in re.finditer(r'<a href="(.*?\.pdf)"[^>]+>([^<]+)', page_contents):
file_url = match.group(1)
# Remove any characters that files cannot contain
file_name = re.sub(r'(\\|/|:|\*|\?|"|<|>|\|)', "", match.group(2))
# Check and see if I have already downloaded the file or not
if not os.path.exists(download_folder + file_name + '.pdf'):
print "Downloading new file {0}...".format(file_name)
urllib.urlretrieve(file_url, download_folder + file_name + '.pdf')


Lets make things more interesting - lets say that you want to get the content of a page, but you must be logged in to get at it. You can handle this just fine as well. Lets say that you want to get into your Google Voice account and scrape the page for the ever so important "_rnr_se" value. (I know that I have shown this a few times before, but many people still wonder how to do this, and it is a good, practical example).

Here are the steps we need to do to make this happen:
  1. Create an "opener" with the urllib2 module, through which all of our requests will be made. The opener will be created with a HTTPCookieProcessor that will handle all the Cookies from request to request (This allows us to stay "logged in").
  2. Install the opener, so whenever we make a request, our opener that we created will be used (and any Cookie data that has been received by previous requests will be sent along and updated when necessary)
  3. Prepare our login credentials, and URL encode them.
  4. Post them to the login page
  5. Do whatever we need once we are logged in.
This might seem like a lot, but it really isn't and it is very simple to do. Here is the script for that:



import urllib, urllib2
import re
from getpass import getpass

email = raw_input("Enter your Gmail username: ")
password = getpass("Enter your password: ")

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

# Set up login credentials for Google Accounts
# The 'continue' param redirects us to the Google Voice
# homepage, and gives us necessary cookie info
login_params = urllib.urlencode( {
'Email' : email,
'Passwd' : password,
'continue' : 'https://www.google.com/voice/account/signin'
})

# Perform the login. Cookie info sent back will be saved, so we remain logged in
# for future requests when using the opener.
# Once we log in, we will be redirected to a page that contains the _rnr_se value on it.
gv_home_page_contents = opener.open( 'https://www.google.com/accounts/ServiceLoginAuth', login_params).read()

# Go through the home page and grab the value for the hidden
# form field "_rnr_se", which must be included when sending texts and dealing with calls
match = re.search(r"'_rnr_se':\s*'([^']+)'", gv_home_page_contents)

if not match:
logged_in = False
else:
logged_in = True
_rnr_se = match.group(1)

if logged_in is True:
print "Loggin successful! _rnr_se value: {0}".format(_rnr_se)
else:
print "Loggin was unsuccessful"



If you are looking to open up a browser and automatically post form data, look at my other post here where I go into more detail.

Hope this helps!

If you want to download all of the examples here, here is a zip-file.

Wednesday, September 2, 2009

Python + Jquery: Open Browser and POST data

A few entries ago I talked about how I used Python to run some tests on a web page that I was creating. Python has a 'webbrowser' module that can open up your default web browser and point it to a specific URL. For the tests that I was running this worked well since the page expected a GET request - all parameters were passed in the URL, which I could change dynamically in my Python source. I didn't need to POST anything to the page with an HTML form.

I wanted a way to open up pages in my web browser with Python, but perform a POST request, sending data along with it. This functionality is not supported by the webbrowser module, nor do I see how it could. So, I came up with solution that meets my needs. It involves jQuery and creating then deleting a temporary file.

For those of you who do not know, jQuery is a JavaScript library that makes programming in JavaScript downright enjoyable, and provides easy solutions to common problems. Google has a Javascript API that lets you easily download stable versions of many different JavaScript libraries, including jQuery. You can do this wherever you would like, and makes it so you don't have to keep the source local (there are many benefits to this approach).

So, these are the steps we take to open a page with Python, and post our predefined (hard-coded or dynamic) form data to a page our our liking:
  1. Dynamically create a complete HTML file.
  2. Include jQuery on the page (this makes it much easier to know when the form is ready for submission).
  3. Create a form on the page with the appropriate action and method.
  4. Insert hidden form elements with their corresponding names and values.
  5. Submit the form when the DOM is finished (jQuery helps with that).
  6. Delete the file when we are done.
There are lots of ways that you could use this, but I made it as a way to test web pages I am working on. Just as an example, here is a script that you could use to open up a page after having entered your Gmail login credentials:


import os
import webbrowser

# Set up form elements - these will become the input elements in the form
input_value = {
'Passwd' : 'YOUR PASSWORD HERE',
'Email' : 'YOUR USER NAME',
'continue': 'https://mail.google.com'
}
action='https://www.google.com/accounts/ServiceLoginAuth?service=mail'
method='post'

#Set up javascript form submition function.
# I am using the 'format' function on a string, and it doesn't like the { and } of the js function
# so I brought it out to be inserted later
js_submit = '$(document).ready(function() {$("#form").submit(); });'

# Set up file content elements
input_field = '<input type="hidden" name="{0}" value="{1}" />'

base_file_contents = """
<script src='http://www.google.com/jsapi'></script>
<script>
google.load('jquery', '1.3.2');
</script>

<script>
{0}
</script>

<form id='form' action='{1}' method='{2}' />
{3}
</form>
"""

# Build input fields
input_fields = ""

for key, value in input_value.items():
input_fields += input_field.format(key, value)

# Open web page
with open('temp_file.html', "w") as file:
file.write(base_file_contents.format(js_submit,action, method, input_fields))
file.close()
webbrowser.open(os.path.abspath(file.name))
os.remove(os.path.abspath(file.name))


Let me know if you find this useful!

Tuesday, September 1, 2009

Python - Kronos Workforce Management Clock In/Out

I have been working on the Web Team of the BYU Marriott School for almost a year now. Overall it has been a very enjoyable experience, except for one thing: Kronos. Kronos is an online time management software solution that we must use to keep track of the hours that we work - you clock in when you come in, and you clock out when you leave. Pretty simple problem with a simple solution, but they have complicated this process by providing many unnecessary levels of navigation with unintuitive and inconvenient controls.

If you have two campus jobs (which I do) then it makes matters worse as the means of clocking in requires additional steps. Even though I only ever clock into one job through this website, there is no way for me to tell the system I want to use this job as my default job. It has hardwired itself to always use this other job as the default - there is no way to change that. I must explicitly tell it every time I clock in that I am clocking if for "Job 2". Also, there is also no way to quickly and easily check if you are clocked in (sometimes I forget if I clocked in). I have to go through 3 levels of navigation to find my time card and check if there is a clock-in time without a clock-out time next to it. The system won't even tell me - I have to determine my status by inspecting my time card personally.

There are many other shortcomings, but these are my biggest complaints. (I am not just a complainer, I already implemented my own time-tracker/management system that handles all of these problems in PHP that unfortunately I can't use at work for payroll purposes.)

So, I wrote a script in Python that will clock in and out of Kronos for me. It isn't too sophisticated, but it works well for my purposes. The script actually does not check to see if you are logged in or out, but based on the name of the file it will perform either the clock in or clock out actions. For example - if the name of the file is "Kronos Login.py" then it will clock you in, giving Kronos the correct job code (you hard-code that in, so I made my clock in job code "Job 2" since that is the only one I ever use). After it has done this, the script will change the name of the file it is in to "Kronos Logout.py". The next time it is run, is sees that it is named "Kronos Logout.py" and will clock out, then change it's name back to "Kronos Login.py". Like I said - not sophisticated, but it solves my problems: No clumsy navigation, and I can just check my Desktop for the script to see my clocked in/out status. I have been using it for a few days now without any problems.

Here is the source (or download it directly):


import sys, os
import urllib2, urllib

# Set up useful variables
file_name = os.path.basename(sys.argv[0])
full_path = os.path.abspath(file_name)
current_directory = os.getcwd() + '\\'

username = 'YOUR USERNAME'
password =

# URLS
kronos_home_page = 'https://kronprod.byu.edu/wfc/applications/suitenav/navigation.do?ESS=true'
kronos_login = 'https://kronprod.byu.edu/wfc/portal'
kronos_timestamp = 'https://kronprod.byu.edu/wfc/applications/wtk/html/ess/timestamp-record.jsp'

# If you have more than one job, set the number of the job you want to clock into here
# Example: '2' or '1' ('1' Kronos will assume you want job 1 by default, so you don't need to set that one)
# Leave blank if you only have one job.
job_number = '2'

# Create our opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

login_credentials = urllib.urlencode({
'password': password,
'username': username
})

# Open home page to get cookies
opener.open(kronos_home_page)

# Login
opener.open(kronos_login, login_credentials)

#Clock in or clock out
if ('Logout' in file_name):
transfer = '' # Logging out - transfer parameter must be blank
else:
if job_number != '':
transfer = '////Job ' + job_number + '/'
else:
transfer = ''

time_stamp_parameters = urllib.urlencode({
'transfer' : transfer
})

opener.open(kronos_timestamp, time_stamp_parameters)

if ('Login' in file_name):
os.rename(full_path, current_directory + 'Kronos Logout.py')
else:
os.rename(full_path, current_directory + 'Kronos Login.py')


I hope to later turn this into a Windows sidebar gadget - if I am successful, I will post that code as well.