[Old and first version of this Medium post.]


Getting Links from Google: Theory vs. the World


Want to prove a bit of coding helps in the humanities? Easy!

We all use google a lot, what if you can store the links you get from search results? It looked like a super-easy task. The steps are simple (access the search engine, perform the search, get results, scrape results, save data, iterate if needed), plus there’s quite a lot of documented code. Even better: the script we want to build is helpful for some colleagues.

Is there any project that looks better to work on it and quickly call it a victory?

Spoiler: it was not that easy (hence the post).

The Basic Idea: Requests and BeautifulSoup

The project outline was easy to map:

  1. reach a search engine;
  2. query it;
  3. get the results of the query;
  4. extract all the links;
  5. save them;
  6. move to the next page;
  7. rinse and repeat.

Step 4 looks as the most scary one. We’ll have to inspect the html and get the right tag. But that’s part of the fun. Ok, there are issues lurking here like “how do I find out when I run out of results?”. But we can agree to have a fixed set of pages scraped or even stop a the first one.

Armed with requests and BeautifulSoup library (if you don’t have them, get the instruction for installation here and here, respectively) we can begin our journey with some standard imports:

import requests





from bs4 import BeautifulSoup as bs

Next, we build our request to a search engine (Google here). To do that we note that all queries on Google have the url that goes as: ‘https://www.google.com/search?q=’ + ‘something to query’. As in this first skecth we don’t want to keep typing our query as an input, so we’ll hard code our query, i.e. search fo ‘Goofy’.

Then, we check the status of our request to make sure everything is ok when we access the page (here Google after we’ve asked something).

import requests





from bs4 import BeautifulSoup as bs



#search for our term with requests

searchreq = requests.get('https://www.google.com/search?q=Goofy')





#ensure it works

searchreq.raise_for_status()

If you want to input a different query everytime you may go with something like this:

import requests





from bs4 import BeautifulSoup as bs





# ask the user what to search

query = input('What do you want to search?')





#search for our term with requests

searchreq = requests.get('https://www.google.com/search?q=' + query)





#ensure it works

searchreq.raise_for_status()

Stuck: The World Strikes Back

[]

Exactly, watch that again.

[]

An empty list.

That’s the result you get. And this, well, this is really disappointing. Why is that? What’s happening? Let’s check what’s going on. Try to print our soup object (if you have Ipython, use the shell). Once you have the soup object printed, try to search our beloved “r” class, the one we are trying to select with out soup object.

You’ll see there’s no longer what you needed for.

This is: the world getting back on us. In practice, theory is not enough. So, well, now we can panic. What’s going on?

Ways Out

You start googling out. You go on Twitter and ask Al Sweigart (the author of Automate the Boring Stuff with Python, a book you should check if you are starting out with Python).

Al was kind enough to let me know that’s common practice for Google to obscure its results. That’s why the soup doesn’t match what we looked at. He briefly reminded me there’s life out of Google, so there are chances to be better off searching on different search engines (he suggested duckduckgo).

That’s reeeeally important (hence the extra Es). We now know the cause of the problem: the HTML we see is not the same we get with our request. And we already have a hint towards a solution: try asking to different search engines.

We can use these new knowledge to build alternative ways.

Rethinking the Issue

We have a new problem. The HTML that delivers our search results is partly out of our control. What can we do? This depends on how we want to fight.

1. Ways Around: Different Search Engines

The first option is to circumvent the problem: we pick a different search engines.

In practice, we go on Wikipedia and asks for search engines names. We then figure out how the query is asked and hope that the links extraction phase stays the same.

Assuming this, that doesn’t look as a costly option. And we hope one of the engines gives us the same html we can inspect.

2. No Way(s): We Fight!

We know what we want to get. Despite the HTML tags being different, we know the links are still there. What about extracting them through regular expressions? It will be difficult and maybe sub-optmial, but rather than risking to fight again with HTML obfuscation, we can tackle the issue once and forever.

We’ll write our regular expression extracting all that http-something. We can predict we will:

Assuming yoou can identify the bad links, more links than required might be better than the [] we got before.

3. Rebuilding: from BeutifulSoup to Selenium

Maybe we can get around the HTML obfuscation and get the search results in a different way. Selenium is another popular Python library that allows us to automate our browsing.

Selenium will open the browser for us and then we’ll have a look at the HTML. Should this fail, we may have Selenium inspect the page for us and copy and paste the inspected html.

This seems something that can work in theory. But requires some extra efforts.

Next Steps

Ok, there’s still a problem but the field looks clearer. In a next post we will start implementing some of the ways out.


This work is carried out as part of a CAS Fellowship as CAS SEE Rijeka. See more on the Fellowship here.