Getting Links from Google: Theory vs. the World (Part. III)


In previous parts (1, 2) we mapped the problem - scraping search results’ links - and figured out ways around results’ obfuscation. It’s time to keep developing a regex-based solution.

Let me stress that this is intended as a brute force exercize. Theory and practice agrees that is super hard to efficiently parse HTML with regexes. Just look here to see how bad it is.

Query Generalization

Our to do list after we set up a regex-based parsing has the generalization of the query. All we have to do to generalize the code we have is to replace the hardcoded ‘goofy’ query with a variable (here: query). Like this:

#query generalized



import re  



import requests



from bs4 import BeautifulSoup as bs







# detaching the query



query = 'goofy'



#search as googlesearch url + query



searchreq = requests.get('https://www.google.com/search?q=' + query)



#that's the same as before...



#ensure it works



searchreq.raise_for_status()



# creating the Beautiful Soup object



soup = bs(searchreq.text, 'html.parser')



links = []



for a in soup.find_all('a', href=True):



    #print(a['href'])



    links.append(a['href'])



    



#print(links)



filter=[]



match = 0



#here we have to change our hard coded goofy query into the query



for item in (links):



    if re.search(query, item) != None and re.search('google', item) is None \



    and re.search('search', item) is None:



        match = match +1



        item = item.replace('/url?q=', '')



        print(match, item, '\n')



        filter.append(item)

Now you are eager to run the program and expect to find the same results as with the hard-coded query.

You bet it, if I wrote that it means it’s not going to happen. And you are right.

SPOILER: You’ll get only one result, or maybe zero. Long story short: you’ll get less than with our previous implementation. Why is that?

Capitalization. In Part II, we hard coded Capitalized Goofy in the query. Of course Google is smart enough to search for both capitalized and non-capitalized versions of a query. But then, when we refine our search of course I wasn’t smart enough to add the same query. In our filtered version I typed in uncapitalized goofy.

As soon as you compare and contrast the two different outcomes of the programs you find out that’s the couse for the different outputs.

Hard Learnt Lesson: CAPITALIZATION mAtTers.

Query Processing

This helps us get on track. It’s time to think about the query. We learnt the hard way that the query needs to be turned into lower case. We also have to split it into its parts. Armed with these, we can then send lowered case parts we get into our input, can’t we? (Well… we’ll see… stopwords anyone?) Anyway, here’s the code to check the queries.

# %% handling query



# ask the user for a the search 



userquery = input('Enter your query')



#lowercaseit



query = userquery.lower()







#if the query has less than one word once we split it, nothing happens



#if the length is the split is more than one we have more words, so we



#assing the query to that



if len(query.split()) > 1:



    query = query.split()



print(query)

We only need to test for queries of more than one word. (The first time I tried something like if there’s one term keep going, else do that… which wasn’t cool).

Yes, input validation is poor. Let’s hope we don’t get a zero length input.

Further, we need to check the length of the split method (remember: split returns a list). If you did something like len(query)>1 you’ll be disappointed. A single word query has more than one letter. Check it on one word queries, capitalized and not and on multiple queries.

Ok, now it’s time to include this into our former program:

# %% general queries of multiple length



import re  



import requests



from bs4 import BeautifulSoup as bs







# ask the user for the search and process the query



userquery = input('Enter your query')



#be sure to use the user query



searchreq = requests.get('https://www.google.com/search?q=' + userquery)



#ensure it works



searchreq.raise_for_status()







#lowercaseit



query = userquery.lower()



#if the query has less than one word once we split it, nothing happens



#if the length is the split is more than one we have more words, so we



#assing the query to that



if len(query.split()) > 1:



    query = query.split()



# creating the Beautiful Soup object



soup = bs(searchreq.text, 'html.parser')



links = []



for a in soup.find_all('a', href=True):



    #print(a['href'])



    links.append(a['href'])



#print(links)



filter=[]



match = 0



for item in (links):



    #loop on queries



    for q in query:    



        #q is what we want to search on every item in the links



        if re.search(q, item) != None and re.search('google', item) is None \



        and re.search('search', item) is None:



            match = match +1



            item = item.replace('/url?q=', '')



            print(match, item, '\n')



            filter.append(item)

Here we are matching a link if only one of the search terms is found. That would be terrible practice if we were searching out of the box. We want to find ‘cheap beginners Python books’ no ‘cheap whatever’ or ‘python qua snakes’ etc.

Luckily, we use the user query to perform the search. Then we use the supernaive method to retrive most of the search Google did for us (but makes it hard to retrive for us). So Google already selected programming books and snakes should fell outside the scope of the results we are trying to retrive.

While we test this new code, check that the input is relevant and you don’t get too many duplications. Wouldn’t it be great to add the item only if it’s not in the filter already? Luckily we can:

# %% general queries of multiple length



import re  



import requests



from bs4 import BeautifulSoup as bs



# ask the user for the search and process the query



userquery = input('Enter your query')



#be sure to use the user query



searchreq = requests.get('https://www.google.com/search?q=' + userquery)



#ensure it works



searchreq.raise_for_status()







#lowercaseit



query = userquery.lower()



#if the query has less than one word once we split it, nothing happens



#if the length is the split is more than one we have more words, so we



#assing the query to that



if len(query.split()) > 1:



    query = query.split()







# creating the Beautiful Soup object



soup = bs(searchreq.text, 'html.parser')



links = []



for a in soup.find_all('a', href=True):



    #print(a['href'])



    links.append(a['href'])



#print(links)







filter=[]



match = 0



for item in (links):



    #loop on queries



    for q in query:    



        #q is what we want to search on every item in the links



        if re.search(q, item) != None and re.search('google', item) is None \



        and re.search('search', item) is None and item not in filter:



            #we added the if not in filter condition



            match = match +1



            item = item.replace('/url?q=', '')



            filter.append(item)



            print(match, item, '\n')

Cool, this works!

Of course, if you make typos you are likely to miss results, try ‘pythons boks beeginners’ and you’ll get []. Still, if you have ‘pythons boks for beeginners’ the ‘for’ will get you somewhere. You may accomodate for this inserting a spellchecker of some kind (here’s Peter Norvig’s famous spell checker, https://norvig.com/spell-correct.html).

To do

  1. saving to file
  2. consider different search engines
  3. compare search results
  4. include a spell checker

This work is carried out as part of a CAS Fellowship as CAS SEE Rijeka. See more on the Fellowship here.