DIY Twitter API: Developing Your Own Using Web Scraping and Python

Nowadays we are looking how the social networks have been turning into toxic environments. One of the problems on my point of view are the fake news. In this context, Twitter is the platform that thanks to the pseudo-anonimity is being the platform with more fake news spreaders (at least in Spain). Obviously Facebook also have the fake news spreaders but it is a different platform and you should  follow/add these spreader to your contacts.


One of the faster solution that have the users to avoid this problem is silence/block the accounts. To do that In the web application, you have to select the option block over the user you want to block. The problem is that it is an arduous task if you do it one by one. I thought if Twitter could have an option to import a list of user to block and with an script return a list of the users that they have retweets a fake new. And voila, Twitter has this option, but, it is no longer available... -.-



At this point, I think about to use the API, which for me it is the best and fastest way. 5 years ago I had have used the Twitter API with the python module Tweepy [1], but when I accessed to the API panel of Twitter It was very different: different prices depending of the plan you choose; some questions about why will you use the API; and more.


When I started to write the answers to get access to the API, I though why should I use the API if I can do an API just with a bit of understanding how Twitter web application works and a bit of programming in Python...


Understanding how the login works

First thing to understand how to perform the login is simulate the login. For this task I have used Burp Suite as proxy in order to examine all the requests and which parameters (and steps) are strictly necessary, removing the unnecessary parameters, cookies and other stuff. I just have look the user/password login, not the SSO login. It is the biggest task since Twitter website have a lot of connections with different platforms like api.twitter.com and the HTML and JS files are very big. So, I performed a login in my Twitter account using burl suite as proxy and get the full login flow.

To complete all the steps it is necessary the bearer token [2], header X-Guest-Token and csrf token. We can obtain these bearer token and guest token from a JS of the Twitter website. In addition, from this JS file it is possible to get the path to perform the query for the retweets.


After that I send every step to Repeater and try to simulate myself all the steps. Once It works, next step is develop it in an script. The result should be the same that in Burp Repeater.



There is the resulting code with all the login flow. It gives you a auth_token, guest token, csrf_token and path for retweets necessary to perform queries in the web application.

First step: Get bearer token, guest token, csrf_token and path:

user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js' }
url_base = "https://twitter.com/home?precache=1"
r = requests.get(url_base, verify=False, headers=user_agent)
soup = BeautifulSoup(r.text, "html.parser")
js_with_bearer = ""
for i in soup.find_all('link'):
    if i.get("href").find("/main") != -1:
        js_with_bearer = i.get("href")

guest_token = re.findall(r'"gt=\d{19}', str(soup.find_all('script')[-1]), re.IGNORECASE)[0].replace("\"gt=","")
print("[*] Js with Bearer token: %s" % js_with_bearer)
print("[*] Guest token: %s" % guest_token)
# Get Bearer token
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js' }
r = requests.get(js_with_bearer, verify=False, headers=user_agent)
#print(r.text)
bearer = re.findall(r'",[a-z]="(.*)",[a-z]="\d{8}"', r.text, re.IGNORECASE)[0].split("\"")[-1]
print("[*] Bearer: %s" % bearer)

rt_path = re.search(r'queryId:"(.+?)",operationName:"Retweeters"', r.text).group(1).split('"')[-1]
viewer_path = re.search(r'queryId:"(.+?)",operationName:"Viewer"', r.text).group(1).split('"')[-1]
print("[*] rt_url: %s" % rt_path)
authorization_bearer = "Bearer %s" % bearer


Second step: Flow login

url_flow_1 = "https://twitter.com/i/api/1.1/onboarding/task.json?flow_name=login"
url_flow_2 = "https://twitter.com/i/api/1.1/onboarding/task.json"
# Flow 1
data = {'' : ''}
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'X-Guest-Token' : guest_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer  }
r = requests.post(url_flow_1, verify=False, headers=user_agent, data=json.dumps(data))
flow_token = json.loads(r.text)['flow_token']
print("[*] flow_token: %s" % flow_token)

# Flow 2
data = {'flow_token' : flow_token, "subtask_inputs" : []}
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'X-Guest-Token' : guest_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer  }
r = requests.post(url_flow_2, verify=False, headers=user_agent, data=json.dumps(data))
flow_token = json.loads(r.text)['flow_token']
print("[*] flow_token: %s" % flow_token)

# Flow 3
username = "youruser"
data = {"flow_token": flow_token ,"subtask_inputs":[{"subtask_id":"LoginEnterUserIdentifierSSOSubtask","settings_list":{"setting_responses":[{"key":"user_identifier","response_data":{"text_data":{"result":username}}}],"link":"next_link"}}]}
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'X-Guest-Token' : guest_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer  }
r = requests.post(url_flow_2, verify=False, headers=user_agent, data=json.dumps(data))
flow_token = json.loads(r.text)['flow_token']
print("[*] flow_token: %s" % flow_token)

if (json.loads(r.text)['subtasks'][0]['subtask_id'] == "LoginEnterAlternateIdentifierSubtask"):
    # Sometimes login alternate because unusual LoginEnterUserIdentifierSSOSubtask
    email = "your@email.sometimes"
    data = {"flow_token": flow_token, "subtask_inputs":[{"subtask_id":"LoginEnterAlternateIdentifierSubtask","enter_text":{"text": email,"link":"next_link"}}]}
    user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'X-Guest-Token' : guest_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer  }
    r = requests.post(url_flow_2, verify=False, headers=user_agent, data=json.dumps(data))
    flow_token = json.loads(r.text)['flow_token']
    print("[*] flow_token: %s" % flow_token)

# Flow 4
password = "yourpassword"
data = {"flow_token": flow_token ,"subtask_inputs":[{"subtask_id":"LoginEnterPassword","enter_password":{"password":password,"link":"next_link"}}]}
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'X-Guest-Token' : guest_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer  }
r = requests.post(url_flow_2, verify=False, headers=user_agent, data=json.dumps(data))
flow_token = json.loads(r.text)['flow_token']
user_id = json.loads(r.text)['subtasks'][0]['check_logged_in_account']['user_id']
print("[*] flow_token: %s" % flow_token)
print("[*] user_id: %s" % user_id)

# Flow 5 (and get auth_token)
data = {"flow_token":flow_token,"subtask_inputs":[{"subtask_id":"AccountDuplicationCheck","check_logged_in_account":{"link":"AccountDuplicationCheck_false"}}]}
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'X-Guest-Token' : guest_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer  }
r = requests.post(url_flow_2, verify=False, headers=user_agent, data=json.dumps(data))
flow_token = json.loads(r.text)['flow_token']
auth_token = r.cookies['auth_token']
print("[*] flow_token: %s" % flow_token)
print("[*] auth_token: %s" % auth_token)


Third step: CSRF Token

payload = '{"withCommunitiesMemberships":true,"withCommunitiesCreation":true,"withSuperFollowsUserFields":true}'
url_session_token = "https://twitter.com/i/api/graphql/%s/Viewer?variables=%s" % (viewer_path, urllib.parse.quote_plus(payload))
cookie = "ct0=%s; auth_token=%s" % (guest_token, auth_token)
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'X-Guest-Token' : guest_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer, 'Cookie' : cookie  }
r = requests.get(url_session_token, verify=False, headers=user_agent)
csrf_token = r.cookies['ct0']
print("[*] CSRF token: %s" % csrf_token)


Getting the list of users of a retweet

With the auth_token and csrf_token we are able to run queries. Thus, obtain the list of users of a retweet is easier, but it also need to understand how the query is done. To do that, Burp Suite is used again.

After analyze how it works, have been observed that only can be received from the web application by 100 users each request. Like all queries and responses are in json, in the response after the list of 100 users (or less, depending) a key with the name "cursor-bottom" is returned. This key is a base64 encoded data with the cursor over the list of users, so, to obtain the full list of users, it is necessary to move the cursor in each request until the end, when the cursor-bottom is the same that the current cursor.





In json response it can be found the user-id and the name of the users that performed the RT.

The following code return the first 100 uses that they have done a retweet:

payload = '{"tweetId":"IDTWEET","count":100,"includePromotedContent":true,"withSuperFollowsUserFields":true,"withDownvotePerspective":false,"withReactionsMetadata":false,"withReactionsPerspective":false,"withSuperFollowsTweetFields":true,"__fs_dont_mention_me_view_api_enabled":false,"__fs_interactive_text":false,"__fs_responsive_web_uc_gql_enabled":false}'
url_rt = "https://twitter.com/i/api/graphql/%s/Retweeters?variables=%s" % (rt_path, urllib.parse.quote_plus(payload))

cookie = "ct0=%s; auth_token=%s" % (csrf_token, auth_token)
user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'Referer' : 'https://twitter.com/sw.js', 'x-guest-token' : guest_token , 'X-Csrf-Token' : csrf_token, 'Content-Type' : 'application/json', 'Authorization' :  authorization_bearer, 'Cookie' : cookie  }
r = requests.get(url_rt, verify=False, headers=user_agent)
message = json.loads(r.text)['data']['retweeters_timeline']['timeline']['instructions'][0]['entries']
for i in message:
    entryId = i['entryId']
    if (entryId.find("user") != -1):
        nick_user = i['content']['itemContent']['user_results']['result']['legacy']['screen_name']
        print("[*] Found: %s\t%s" % (entryId, nick_user))
    elif (entryId.find("cursor-bottom") != -1):
        next = i['content']['value']

Full code with the loop that retrieve all users in [3].


Blocking users massively

Like previous steps, with the web proxy running go to the web application, block a user and analyze in Burp Suite how it is blocked.

In this case is quite simple:



And the following code perform the block action:

def blockAccount(user_id, auth_token, csrf_token, authorization_bearer):
    url_block = "https://twitter.com/i/api/1.1/blocks/create.json"
    data = "user_id=%s" % user_id
    cookie = "ct0=%s; auth_token=%s" % (csrf_token, auth_token)
    user_agent = { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0', 'X-Csrf-Token' : csrf_token, 'Content-Type' : 'application/x-www-form-urlencoded', 'Authorization' :  authorization_bearer, 'Cookie' : cookie  }
    r = requests.post(url_block, verify=False, headers=user_agent, data=data)
    r_id = json.loads(r.text)['id_str']
    if (r_id == user_id):
        print("[+] User blocked: %s" % r_id)


Finally, we can run this function in the loop of the users that did a retweet and they will be blocked.


Peace

Now we can login in the Twitter account and verify all of the fake news spreaders have been blocked.



And you will not see more hate speech and fake news (at least from this people... -.-')

In addition, it can be used to continue with the option to import a block user list which is no longer available in Twitter.



UPDATE

Since I wrote this post Twitter has been implemented some changes. These changes and the full code of this tool can be found in the repository of GitHub. (03/02/2023)
 


[1] https://www.tweepy.org/

[2] https://datatracker.ietf.org/doc/html/rfc6750

[3] https://github.com/Sinkmanu/TwitterAccountBlocker

Comments (2)

Jack

Feb. 2, 2023 @ 23.21

Not working anymore.


Manu

Feb. 2, 2023 @ 23.21

Thanks for notifying. The issue was that they have modified all the Javascript files. But now it is fixed. I do not know how many time will still working because it will be in continuous development...


Leave a comment