Data mining from webpages with Python Mechanize Browser Automation – a Big data tutorial

Data mining from the web with Python Mechanize Browser Automation – Big data tutorial

In this example the Python Mechanize package is used for browser automation – Selenium is much more feature rich (and is also a bit more difficult to use) and is to be used when feature-rich Javascript and Ajax website data mining or automated test case setup is to be built. A Selenium browser automation example can be found here.

This rich commented Python Mechanize Browser Automation example does the following:

  • Logs in to a webpage with custom credentials
  • Opens an other page by knowing the session used for login
  • Finds a form and two input fields of it
    • Adds multiple list elements in a for cycle to the first field
      • Adds multiple list elements in an other for cycle to the second field
        • Lists the results of the above query
        • The list can have multiple sub pages (the list shows limited results and have a pager)
          • Checks all the links of the query matching a pattern
          • LABEL: PAGE – Opens all the links from the next page
            • Mines all the unfiltered data from the opened pages
            • Goes back to LABEL: PAGE step recursively
# Import Python Mechanize Browser Automation Library
import mechanize
# Import regular Expressions
import re
import collections

# Dropdown values for the forms
dropdownelements = ['Selection1', 'Selection2']
dropdownelements2 = ['Selection10', 'Selection20']

# Initialize the browser
browser = mechanize.Browser()
browser.set_handle_robots(False)
# Simulate Firefox
browser.addheaders = [('User-agent', 'Firefox')]
# Open the page
browser.open("http://www.tobecheckedpage")
# Select the form named login, e.g. in the HTML <input name="login" type="text" /> 
browser.select_form(name="login")

# Find the form element named login and add myloginid to it
browser["login"] = "myloginid"
# Find the form element named password and add mypassword to it
browser["password"] = "mypassword"
# Click on the submit button
response = browser.submit()

# For all dropdownelements
for dropdownelement in dropdownelements:
	# For all dropdownelements2
	for dropdownelement2 in dropdownelements2:
		# Open a page
		page = browser.open("http://www.tobecheckedpage/page")
		# Get the response to console
		print page.read()
		# Find element by name, e.g. in the HTML <form name="code">
		browser.select_form(name="code")
		# Find input element option and add dropdownelement as value
		browser.form[ 'options' ] = dropdownelement
		# Find input element keywords and add dropdownelement2 as value
		browser.form[ 'keywords' ] = dropdownelement2
		# Submit the form
		browser.submit()
		# Call recursive checker - as form has been sent, we have the results of the query
		checklinks(browser, country, job)

# The recursive query to chech multi pages of search results; check the main part of the code first
def checklinks(browser, dropdownelements, dropdownelements2):
	# Initialize a second browser with the same properties that the first has
	browser2 = browser
	# Get all the links on the search query page
	for link in browser.links():
		# For all links do:
		# Filter the links that has RegExp in their href
	    siteMatch = re.compile( '/RegExp' ).search( link.url )
		# If link contains RegExp
	    if siteMatch:
		# Then open the link
		resp = browser2.follow_link( link )
		# And store its content
		content = resp.get_data()
		# Do data mining in the content of the detail page resulting from the query result page
	# An other for cycle checking all the links on the result page (again)	
	for link in browser.links():
		#Check if the link has the text 'Next >'
		siteMatch = re.compile( 'Next >' ).search( link.text )
		# If it has
		if siteMatch:
			# Then follow the link (go to next page of the results)
			resp = browser.follow_link( link )
			# Call recursive checker - as form has been sent, we have the results of the query
			checklinks(browser, country, job)