top of page
  • doctorsmonsters

Find Relevant Top Hashtags Using Python -Part 2



This tutorial is built on part 1. Source code for this tutorial can be found here.


Recap

In part 1, we wrote code for retrieving tweets associated with user provided hashtag and then scrape them for other tweets. In part 2, we will use beautiful soup to scrape Instagram posts from hashtags.


Scraping Instagram

The good thing about this code is that you do not need to log into any Instagram account. Anyone can access publicly available posts on Instagram using the hashtag. For example if you want to see the posts for the hashtag #newyork, you can do so by using the following URL:

We use the above link to retrieve the posts, the HTML response using beautiful soup and retrieve the JSON dictionary. After that, we extract all the text associated with the posts, extract hashtags and combine them with our twitter hashtags list before returning it to the user.

We start by importing the required libraries.

import re
import bs4
import requests
import json

If you recall, we got user input and cast it to the variable “tag” and then using out “clean_tag” function, clean it and cast it to “search_word” variable. For revision, the code is as follow:

tag =str(input(“Please enter your hashtag/text: “))

def clean_input(tag):
    tag = tag.replace(" ", "")
    if tag.startswith('#'):
        return tag[1:].lower()
    else:
        return tag.lower()search_tag=clean_input(tag)tag =str(input("Please enter your hashtag/text: "))
search_tag = clean_input(tag)

Now we build the URL with the user input and then using beautiful soup, retrieve the HTML response for further parsing.

url_string = "https://www.instagram.com/explore/tags/%s/" % search_tag
response = bs4.BeautifulSoup(requests.get(url_string).text, "html.parser")

Posts in the Instagram HTML response are served as JavaScript object. We will write a function to extract the JSON dictionary from the JavaScript. Basically we pull the “script” tag from the soup, derived from HTML response by Beautiful Soup. We then perform some text replacements to derive string that is then loaded in a dictionary object.

def extract_shared_data(doc):
    for script_tag in doc.find_all("script"):
        if script_tag.text.startswith("window._sharedData ="):
            shared_data = re.sub("^window\._sharedData = ", "", script_tag.text)
            shared_data = re.sub(";$", "", shared_data)
            shared_data = json.loads(shared_data)
            return shared_data

We will then be adding the following line to out get_hashtags function:

shared_data = extract_shared_data(response)

I recommend you play around with the shared_data JSON dictionary to try to understand it’s structure. Personally I like the flexibility offered by Jupyter notebook, which comes in very handy for something like this, you can navigate the dictionary structure but changing keys. This dictionary is very data rich, containing number of posts attached to the provided hashtag, URLs of the pictures, their captions, user ID as well as post ID. To extract the list structure containing all the posts, we use the following code:

media=shared_data[‘entry_data’][‘TagPage’][0][‘graphql’][‘hashtag’][‘edge_hashtag_to_media’][‘edges’]

This list contains all the posts, termed as “nodes” by Instagram. Once again, I recommend you study the structure. It is very interesting to see the amount of information stored regarding every post. We can now loop through each item of the list and retrieve the caption, as done by the following code snippet from the function get_hashtags.

captions = []
for post in media:
    if post['node']['edge_media_to_caption']['edges'] != []:
        captions.append(post['node']['edge_media_to_caption']['edges'][0]['node']['text'])

Retrieving Hashtags

From here onward, the process is simple. We now have a list containing all the captions. In part one, we wrote a function that extracts hashtags from a list of tweets.

def return_all_hashtags(tweets, tag):
    all_hashtags = []
    for tweet in tweets:
        for word in tweet.split():
            if word.startswith('#') and word.lower() != '#' + tag.lower():
                all_hashtags.append(word.lower())
    return all_hashtags

We can now pass the list of captions from Instagram to this function which will then return a lists of all hashtags in the captions.

The Big Finale

The code above is for extracting data from Instagram alone. However to make our hashtags scraping code work for both Instagram and twitter, we can re-write our get_hashtags function.

def get_hashtags(tag):
    search_tag = clean_input(tag)
    tweets = tw.Cursor(api.search,
                       q='#' + search_tag,
                       lang="en").items(200)
    tweets_list = []
    for tweet in tweets:
        tweets_list.append(tweet.text)
    # retrieve instagram posts list

    url_string = "https://www.instagram.com/explore/tags/%s/" % search_tag
    response = bs4.BeautifulSoup(requests.get(url_string).text, "html.parser")

    # extract post list:
    shared_data = extract_shared_data(response)
    media = shared_data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']

    captions = []
    for post in media:
        if post['node']['edge_media_to_caption']['edges'] != []:
            captions.append(post['node']['edge_media_to_caption']['edges'][0]['node']['text'])

    all_tags = return_all_hashtags(tweets_list + captions, tag)
    frequency = {}
    for item in set(all_tags):
        frequency[item] = all_tags.count(item)
    return {k: v for k, v in sorted(frequency.items(), key=lambda item: item[1], reverse=True)}

Now we can add the final code that will bring it all together and print all the hashtags with their combined count from both Instagram and twitter.

all_tags = get_hashtags(tag)
for item in all_tags:
    print(item, all_tags[item])

End-note

After this code, I would hope you are able to retrieve tweets from twitter, retrieve Instagram posts, get more information about posts that you can use in your future projects.

I would like to thank the author of the following post, and you can refer to it as well to read more about scraping Instagram data.


110 views0 comments

Recent Posts

See All
Post: Blog2_Post
bottom of page