[Old and first version of this Medium post.]


Getting Links from Google: Theory vs. the World (Part. II)

In the previous post we figured out Google returns our request with a different HTML from the one we inspect online. We came up with the following alternative:

  1. change the search engine;
  2. brute force the extraction of the links;
  3. refactor the program using a different automation.

It’s time to explore some of these. Let’s get to work.

1. New Search Engines

A first idea is to change the search engine. We can take one of the first programs and add a little bit to it. We use a list of engines containg the url that is used for the search (as in the previous part) and detach the query. Here’s a code sample.

import requests

from bs4 import BeautifulSoup as bs



#list of search engines

engines = ['https://www.bing.com/search?q=', 'https://search.yahoo.com/?q=', 'https://duckduckgo.com/?q=']





# what we want to search (feel free to make it an input)

query = 'Goofy'





#loop on engines to search the query

for item in engines:

  searchreq = requests.get(item + query)



  #ensure it works

  searchreq.raise_for_status()

If we run this code, we get a 418 Client Error for duckduckgo (I’m a teapot error). This rules an engine out, but we can go on with the others.

We know what to do from part 1. Make a soup out of our the first two engines, then inspect their search output. Bad news: the previous code for the request is helpless. There’s no trace of ‘class r’. Still, all we want to find out is hether the we are able to make an HTML soup with the HTML we can inspect online or not.

If what we see is what we get, then we figure out the extraction pattern. Otherwise we save time and move on. Luckily this test requires only to inspect the soup objects. All we need is to add the soup object to our previous code (limiting our search to the first two items in the list).

import requests

from bs4 import BeautifulSoup as bs





#list of search engines

engines = ['https://www.bing.com/search?q=', 'https://search.yahoo.com/?q=', 'https://duckduckgo.com/?q=']





# what we want to search (feel free to make it an input)

query = 'Goofy'





#loop on engines to search the query

for item in engines[:-1]:

  searchreq = requests.get(item + query)

  #ensure it works

  searchreq.raise_for_status()





  # creating the Beautiful Soup object

  soup = bs(searchreq.text, 'html.parser')

  print(soup)

The output is terribly long. There’s JavaScript in there. A sign we won’t be lucky. No trace of that nice ‘h2 original title’ nor the ‘li class=algo’ on Bing. On yahoo there’s no ‘h3 class=’title’’. We are back to where we started. (Well, on the way the found a further problem with the 418 error). We tried a quick fix changing the search engine, but it didn’t work. Ok, now let’s try something different.

Time to brute force the link extraction.

Further Implementations:

We tackled only a single word query. The first exercise and improvement is to generalize our strategy to random queries (i.e. queries supplied as inputs).

Then, in order to further explore this regex procedure we should:

  1. iterate the procedure to the more than one page of results;
  2. test (and most likely improve) our code on queries of more than one word, like, e.g., ‘financial crisis’, ‘python books for beginners’, etc.

We’ll get there. So it looks as if there will be a part. 3.


This work is carried out as part of a CAS Fellowship as CAS SEE Rijeka. See more on the Fellowship here.