Skip to content
Snippets Groups Projects
Commit e8ba1d1e authored by Anh.Nguyen2's avatar Anh.Nguyen2
Browse files

notebook

parent caf2aa7f
Branches main
No related tags found
No related merge requests found
%% Cell type:code id:478e81f6 tags:
``` python
#Importer elasticsearch, n'oubliez pas de l'installer via pip3 install elasticsearch
import elasticsearch
from elasticsearch import Elasticsearch
print (elasticsearch.VERSION)
```
%% Output
(8, 6, 0)
%% Cell type:code id:d8443460 tags:
``` python
#Connexion avec le server elasticsearch lancé sur Docker, il faut bien fournir le bon password, et le bon path vers http_ca.crt
ELASTIC_PASSWORD = "udw-afSAGmlLF7wT1DWz"
# Create the client instance
client = Elasticsearch(
"https://localhost:9200",
ca_certs="./http_ca.crt",
basic_auth=("elastic", ELASTIC_PASSWORD)
)
```
%% Cell type:code id:e28dace4 tags:
``` python
#tester la connexion
client.info()
```
%% Output
ObjectApiResponse({'name': 'cc70c57dc129', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'RuNPQIiSSWCF-y3IQSYbXA', 'version': {'number': '8.6.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'f67ef2df40237445caa70e2fef79471cc608d70d', 'build_date': '2023-01-04T09:35:21.782467981Z', 'build_snapshot': False, 'lucene_version': '9.4.2', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})
%% Cell type:code id:851246c7 tags:
``` python
import pandas as pd
from pandas import json_normalize
```
%% Cell type:code id:54d5c1f3 tags:
``` python
#Get raw_tweets, filter by "en"
raw_tweets = pd.read_json(r'./farmers-protest-tweets-2021-2-4.json', lines=True)
raw_tweets = raw_tweets[raw_tweets['lang']=='en']
```
%% Output
Shape: (48429, 21)
url \
0 https://twitter.com/ArjunSinghPanam/status/136...
1 https://twitter.com/PrdeepNain/status/13645062...
3 https://twitter.com/anmoldhaliwal/status/13645...
8 https://twitter.com/anmoldhaliwal/status/13645...
11 https://twitter.com/anmoldhaliwal/status/13645...
date \
0 2021-02-24 09:23:35+00:00
1 2021-02-24 09:23:32+00:00
3 2021-02-24 09:23:16+00:00
8 2021-02-24 09:22:34+00:00
11 2021-02-24 09:21:51+00:00
content \
0 The world progresses while the Indian police a...
1 #FarmersProtest \n#ModiIgnoringFarmersDeaths \...
3 @ReallySwara @rohini_sgh watch full video here...
8 @mandeeppunia1 watch full video here https://t...
11 @mandeeppunia1 watch full video here https://t...
renderedContent id \
0 The world progresses while the Indian police a... 1364506249291784198
1 #FarmersProtest \n#ModiIgnoringFarmersDeaths \... 1364506237451313155
3 @ReallySwara @rohini_sgh watch full video here... 1364506167226032128
8 @mandeeppunia1 watch full video here youtu.be/... 1364505991887347714
11 @mandeeppunia1 watch full video here youtu.be/... 1364505813834989568
user \
0 {'username': 'ArjunSinghPanam', 'displayname':...
1 {'username': 'PrdeepNain', 'displayname': 'Pra...
3 {'username': 'anmoldhaliwal', 'displayname': '...
8 {'username': 'anmoldhaliwal', 'displayname': '...
11 {'username': 'anmoldhaliwal', 'displayname': '...
outlinks \
0 [https://twitter.com/ravisinghka/status/136415...
1 []
3 [https://youtu.be/-bUKumwq-J8]
8 [https://youtu.be/-bUKumwq-J8]
11 [https://youtu.be/-bUKumwq-J8]
tcooutlinks replyCount retweetCount ... quoteCount \
0 [https://t.co/es3kn0IQAF] 0 0 ... 0
1 [] 0 0 ... 0
3 [https://t.co/wBPNdJdB0n] 0 0 ... 0
8 [https://t.co/wBPNdJdB0n] 0 0 ... 0
11 [https://t.co/wBPNdJdB0n] 0 0 ... 0
conversationId lang \
0 1364506249291784198 en
1 1364506237451313155 en
3 1364350947099484160 en
8 1364428985074032646 en
11 1364480983995584515 en
source \
0 <a href="http://twitter.com/download/iphone" r...
1 <a href="http://twitter.com/download/android" ...
3 <a href="https://mobile.twitter.com" rel="nofo...
8 <a href="https://mobile.twitter.com" rel="nofo...
11 <a href="https://mobile.twitter.com" rel="nofo...
sourceUrl sourceLabel \
0 http://twitter.com/download/iphone Twitter for iPhone
1 http://twitter.com/download/android Twitter for Android
3 https://mobile.twitter.com Twitter Web App
8 https://mobile.twitter.com Twitter Web App
11 https://mobile.twitter.com Twitter Web App
media retweetedTweet \
0 None NaN
1 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
3 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
8 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
11 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
quotedTweet \
0 {'url': 'https://twitter.com/RaviSinghKA/statu...
1 None
3 None
8 None
11 None
mentionedUsers
0 [{'username': 'narendramodi', 'displayname': '...
1 [{'username': 'Kisanektamorcha', 'displayname'...
3 [{'username': 'ReallySwara', 'displayname': 'S...
8 [{'username': 'mandeeppunia1', 'displayname': ...
11 [{'username': 'mandeeppunia1', 'displayname': ...
[5 rows x 21 columns]
%% Cell type:code id:4ad2a607 tags:
``` python
#Unnest users et drop duplicate
users = json_normalize(raw_tweets['user'])
users.rename(columns={'id':'userId'}, inplace=True)
users.drop_duplicates(subset=['userId'], inplace=True)
users = pd.DataFrame(users)
```
%% Output
Shape: (12407, 21)
username displayname userId \
0 ArjunSinghPanam Arjun Singh Panam 45091142
1 PrdeepNain Pradeep Nain 1355092620662329349
2 anmoldhaliwal Anmol 137908912
5 ShariaActivist Sharia Ali Siddique 1362487487747121152
6 KaurDosanjh1979 Red 💚 538638801
description \
0 Global Citizen, Actor, Director: Sky is the ro...
1 Live in the sunshine where you belong
2 coming soon
5 Little Climate & Environmental Activist | Foun...
6
rawDescription descriptionUrls \
0 Global Citizen, Actor, Director: Sky is the ro... []
1 Live in the sunshine where you belong []
2 coming soon []
5 Little Climate & Environmental Activist | Foun... []
6 []
verified created followersCount friendsCount ... \
0 False 2009-06-06T07:50:57+00:00 603 311 ...
1 False 2021-01-29T09:58:06+00:00 14 134 ...
2 False 2010-04-28T03:12:18+00:00 51 27 ...
5 False 2021-02-18T19:41:57+00:00 46 106 ...
6 False 2012-03-27T23:14:32+00:00 427 1005 ...
favouritesCount listedCount mediaCount location protected \
0 4269 23 1211 False
1 240 0 102 False
2 77 0 12 Brampton, On False
5 60 0 53 she/they False
6 18962 0 30 False
linkUrl linkTcourl \
0 https://www.cosmosmovieofficial.com https://t.co/3uaoV3gCt3
1 None None
2 None None
5 None None
6 None None
profileImageUrl \
0 https://pbs.twimg.com/profile_images/121554174...
1 https://pbs.twimg.com/profile_images/136417063...
2 https://pbs.twimg.com/profile_images/156497514...
5 https://pbs.twimg.com/profile_images/136428288...
6 https://pbs.twimg.com/profile_images/135582023...
profileBannerUrl \
0 https://pbs.twimg.com/profile_banners/45091142...
1 https://pbs.twimg.com/profile_banners/13550926...
2 None
5 https://pbs.twimg.com/profile_banners/13624874...
6 https://pbs.twimg.com/profile_banners/53863880...
url
0 https://twitter.com/ArjunSinghPanam
1 https://twitter.com/PrdeepNain
2 https://twitter.com/anmoldhaliwal
5 https://twitter.com/ShariaActivist
6 https://twitter.com/KaurDosanjh1979
[5 rows x 21 columns]
%% Cell type:code id:c30e443b tags:
``` python
#Renommer quelques colonnes et drop duplicate
user_id = []
for user in raw_tweets['user']:
uid = user['id']
user_id.append(uid)
raw_tweets['userId'] = user_id
# ne garder que les colonnes utiles
cols = ['id','userId','date', 'renderedContent','likeCount']
tweets = raw_tweets[cols]
tweets.rename(columns={'id':'tweetId', 'url':'tweetUrl'}, inplace=True)
tweets = pd.DataFrame(tweets)
tweets.drop_duplicates(subset=['tweetId'], inplace=True)
tweets.head(5)
```
%% Output
C:\Users\ngoch\anaconda3\lib\site-packages\pandas\core\frame.py:5039: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().rename(
tweetId userId date \
0 1364506249291784198 45091142 2021-02-24 09:23:35+00:00
1 1364506237451313155 1355092620662329349 2021-02-24 09:23:32+00:00
3 1364506167226032128 137908912 2021-02-24 09:23:16+00:00
8 1364505991887347714 137908912 2021-02-24 09:22:34+00:00
11 1364505813834989568 137908912 2021-02-24 09:21:51+00:00
renderedContent likeCount
0 The world progresses while the Indian police a... 0
1 #FarmersProtest \n#ModiIgnoringFarmersDeaths \... 0
3 @ReallySwara @rohini_sgh watch full video here... 0
8 @mandeeppunia1 watch full video here youtu.be/... 0
11 @mandeeppunia1 watch full video here youtu.be/... 0
%% Cell type:code id:be2ad904 tags:
``` python
users = users[['userId','location']]
```
%% Output
userId location
0 45091142
1 1355092620662329349
2 137908912 Brampton, On
5 1362487487747121152 she/they
6 538638801
... ... ...
48409 714731566828793857 Mumbai, India
48413 1214383604752478208
48416 482737576 india
48417 57621609 Pune, India
48423 1355780354091720704
[12407 rows x 2 columns]
%% Cell type:code id:746551d9 tags:
``` python
#Table finale
tweets = pd.merge(tweets,users)
```
%% Cell type:code id:6366dee6 tags:
``` python
print(tweets)
```
%% Output
tweetId userId date \
0 1364506249291784198 45091142 2021-02-24 09:23:35+00:00
1 1364154888666644481 45091142 2021-02-23 10:07:24+00:00
2 1364111785486413828 45091142 2021-02-23 07:16:08+00:00
3 1364109014393626629 45091142 2021-02-23 07:05:07+00:00
4 1364000091208622082 45091142 2021-02-22 23:52:18+00:00
... ... ... ...
48424 1360040890438283264 714731566828793857 2021-02-12 01:39:51+00:00
48425 1360040704186064896 1214383604752478208 2021-02-12 01:39:06+00:00
48426 1360040623881940995 482737576 2021-02-12 01:38:47+00:00
48427 1360040616541884418 57621609 2021-02-12 01:38:45+00:00
48428 1360040267806605316 1355780354091720704 2021-02-12 01:37:22+00:00
renderedContent likeCount \
0 The world progresses while the Indian police a... 0
1 An honest piece to counteract the attempt to d... 0
2 Today hashtag for the #farmersprotest is \n\n#... 0
3 Humanity against tyranny. \n\n#FarmersProtest ... 1
4 After months of protests, over 200 deaths, 11 ... 1
... ... ...
48424 Farmers at the Singhu border have started stre... 7
48425 @sachin_rt youtu.be/smjfGmCn7x0\nHope we dont ... 0
48426 #farmersProtest getting mommentum from differ... 3
48427 if you have some interest in understanding #Fa... 4
48428 @BeingSalmanKhan Salmon Khan, wear bangles and... 0
location
0
1
2
3
4
... ...
48424 Mumbai, India
48425
48426 india
48427 Pune, India
48428
[48429 rows x 6 columns]
%% Cell type:code id:0888a459 tags:
``` python
#Create index
client.indices.create(index='farmersprotest')
```
%% Output
ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'farmersprotest'})
%% Cell type:code id:db8686c0 tags:
``` python
#Injection des données via bulk
from elasticsearch import helpers
total_number = 48429
c = 0
actions = []
n = 100
while (c < total_number):
actions = []
j = 0
while (j < total_number//n):
action = {
"_index": "farmersprotest",
"_id": tweets['tweetId'].iloc[c],
"userId":tweets['userId'].iloc[c],
"_doc": tweets['renderedContent'].iloc[c],
"date": tweets['date'].iloc[c],
"location": tweets['location'].iloc[c],
"like": tweets["likeCount"].iloc[c]
}
actions.append(action)
j += 1
c += 1
if c == 48429:
break
helpers.bulk(client, actions)
```
%% Cell type:code id:f4213a8f tags:
``` python
client.search(index='farmersprotest',body={"query": {"match_all": {}}})
```
%% Output
C:\Users\ngoch\AppData\Local\Temp/ipykernel_26524/3406525718.py:1: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.
client.search(index='farmersprotest',body={"query": {"match_all": {}}})
ObjectApiResponse({'took': 824, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 10000, 'relation': 'gte'}, 'max_score': 1.0, 'hits': [{'_index': 'farmersprotest', '_id': '1364506249291784198', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': 'The world progresses while the Indian police and Govt are still trying to take India back to the horrific past through its tyranny. \n\n@narendramodi @DelhiPolice Shame on you. \n\n#ModiDontSellFarmers \n#FarmersProtest \n#FreeNodeepKaur twitter.com/ravisinghka/st…', 'date': '2021-02-24T09:23:35+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1364154888666644481', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'An honest piece to counteract the attempt to dehumanise the protesting Sikhs in Denmark and Europe. \n\n#farmersprotest twitter.com/sikharchive/st…', 'date': '2021-02-23T10:07:24+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1364111785486413828', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Today hashtag for the #farmersprotest is \n\n#Pagdi_Sambhal_Jatta twitter.com/kisanektamorch…', 'date': '2021-02-23T07:16:08+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1364109014393626629', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Humanity against tyranny. \n\n#FarmersProtest \n#humanrights twitter.com/ravisinghka/st…', 'date': '2021-02-23T07:05:07+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1364000091208622082', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': 'After months of protests, over 200 deaths, 11 meetings. \n\n@nstomar still hasn’t figured out the issue. \n\nSir take the farmer’s offer of an open public debate and you’ll be the most informed, then you’ll be first in line to get the laws repealed. \n\n#farmersprotest #freenodeepkaur https://t.co/lMyE2E1kLp', 'date': '2021-02-22T23:52:18+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1363997412424040448', '_score': 1.0, '_source': {'userId': 45091142, '_doc': '22/02/21 Updates from @jagjitvaheguru 🙏🏽\n\n#farmersprotest twitter.com/jagjitvaheguru…', 'date': '2021-02-22T23:41:39+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1363760114969288708', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Humanity against tyranny. \n\n#FarmersProtest twitter.com/reutersindia/s…', 'date': '2021-02-22T07:58:43+00:00', 'location': '', 'like': 2}}, {'_index': 'farmersprotest', '_id': '1363753090181197824', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': 'Not just a few mislead people. \n\n@narendramodi These are the people who put you in the position of governing the country. \n\nThey didn’t put you there to be ignored, violently attacked, sexually abused or have their food and water cut off by your @DelhiPolice\n\n#farmersprotest twitter.com/reuters/status…', 'date': '2021-02-22T07:30:48+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1363613574317408266', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Today’s update 21/02/21: \n\nThank you @jagjitvaheguru 🙏🏽\n\n#FarmersProtest \n#MSPLawForAllCrops \n#ReleaseNodeepKaur twitter.com/jagjitvaheguru…', 'date': '2021-02-21T22:16:25+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1363029589694574594', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': '@MankamalSingh @Punjab2000music @SikhPA @minkaur5 @SCUKofficial @SikhFedUK @sikhsinscotland @DrJasjitSingh Their response completely ignores the fact that Geeta cuts the speaker off and says, \n“it’s not propaganda.”\n\nIn response to @DrSharandeep stating that the govt of Indian is spreading propaganda regarding the farmer’s protests being hijacked. \n\n@BBCNews \n\n#FarmersProtest https://t.co/Tp5A3ajVuE', 'date': '2021-02-20T07:35:52+00:00', 'location': '', 'like': 3}}]}})
%% Cell type:code id:e2b8835e tags:
``` python
```
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment