Friday, September 18, 2009

Python - Back to Basics: Accessing Online Content

Many of the different scripts that I write in Python deal with accessing content online. I thought that it might be a good idea to go over some of the different methods Python provides, and give some examples of their every day uses.

Most of the functionality you may ever want to access online content can be found in one of two modules (and often you need to use both) : urllib and urllib2.

First situation: You want to download the content of some page, perhaps to scrape the page for information for later processing. You want the content to be stored in a single variable (a string) so that you can do some regular expression matching on it, or some other form of parsing. A few simple lines:


import urllib2

page_contents = urllib2.urlopen("http://www.google.com").read()

print page_contents


Notice how on the second line I have the ".read()" method call. If you leave this off, you will have just the socket file object, which you can later call ".read()" on to get the contents. I like to do it all in one line however. Once you do this, it is a regular string and you can do whatever you want to/with it.

Now, lets say that you don't want to just download the page contents - you want to download an actual file, like an image, video, Power Point Presentation, PDF or anything else. Also, just a few lines:


# Download the Python image from http://python.org
import urllib

# File will be called "Python-Logo.gif",and will be contained in the folder
# where the script was executed
urllib.urlretrieve("http://python.org/images/python-logo.gif", "Python-Logo.gif")


You can use the "urlretrive" method to download pretty much any file you want - you just need to know it's URL and what you want to call it when you download it. Keep in mind that if you don't specify anything but the name of the file, it will be downloaded to the folder where you run the script. You can put things like "C:\images\Python-Logo.gif" in there if you want it to go somewhere else.

Using what we have gone over in the first two example, we can write some useful scripts. W can get the page source, find all of the PDF files linked to on it, and download them all to a folder. I do this very thing for one of the classes that I am currently taking. Here is that script:


import urllib
import re
import os

# Specifiy where I want to download my files
download_folder = "C:\DropBox\My Dropbox\School\CS479\Slides\\"

# Download page contents
# Note: I am using urllib, not urllib2, but only because urlretrieve does not exist in urllib2
# Both have this method, and do pretty much the same thing (for this kind of usage, that is)
page_contents = urllib.urlopen("https://cswiki.cs.byu.edu/cs479/index.php/Lecture_slides").read()

# Use regular expression to find all pdf files on site.
# match.group(1) will contain the link to the file
# match.group(2) will contain the name of the file
for match in re.finditer(r'<a href="(.*?\.pdf)"[^>]+>([^<]+)', page_contents):
file_url = match.group(1)
# Remove any characters that files cannot contain
file_name = re.sub(r'(\\|/|:|\*|\?|"|<|>|\|)', "", match.group(2))
# Check and see if I have already downloaded the file or not
if not os.path.exists(download_folder + file_name + '.pdf'):
print "Downloading new file {0}...".format(file_name)
urllib.urlretrieve(file_url, download_folder + file_name + '.pdf')


Lets make things more interesting - lets say that you want to get the content of a page, but you must be logged in to get at it. You can handle this just fine as well. Lets say that you want to get into your Google Voice account and scrape the page for the ever so important "_rnr_se" value. (I know that I have shown this a few times before, but many people still wonder how to do this, and it is a good, practical example).

Here are the steps we need to do to make this happen:
  1. Create an "opener" with the urllib2 module, through which all of our requests will be made. The opener will be created with a HTTPCookieProcessor that will handle all the Cookies from request to request (This allows us to stay "logged in").
  2. Install the opener, so whenever we make a request, our opener that we created will be used (and any Cookie data that has been received by previous requests will be sent along and updated when necessary)
  3. Prepare our login credentials, and URL encode them.
  4. Post them to the login page
  5. Do whatever we need once we are logged in.
This might seem like a lot, but it really isn't and it is very simple to do. Here is the script for that:



import urllib, urllib2
import re
from getpass import getpass

email = raw_input("Enter your Gmail username: ")
password = getpass("Enter your password: ")

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

# Set up login credentials for Google Accounts
# The 'continue' param redirects us to the Google Voice
# homepage, and gives us necessary cookie info
login_params = urllib.urlencode( {
'Email' : email,
'Passwd' : password,
'continue' : 'https://www.google.com/voice/account/signin'
})

# Perform the login. Cookie info sent back will be saved, so we remain logged in
# for future requests when using the opener.
# Once we log in, we will be redirected to a page that contains the _rnr_se value on it.
gv_home_page_contents = opener.open( 'https://www.google.com/accounts/ServiceLoginAuth', login_params).read()

# Go through the home page and grab the value for the hidden
# form field "_rnr_se", which must be included when sending texts and dealing with calls
match = re.search(r"'_rnr_se':\s*'([^']+)'", gv_home_page_contents)

if not match:
logged_in = False
else:
logged_in = True
_rnr_se = match.group(1)

if logged_in is True:
print "Loggin successful! _rnr_se value: {0}".format(_rnr_se)
else:
print "Loggin was unsuccessful"



If you are looking to open up a browser and automatically post form data, look at my other post here where I go into more detail.

Hope this helps!

If you want to download all of the examples here, here is a zip-file.
blog comments powered by Disqus