Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • Anh.Nguyen2/sffs_elasticsearch
1 result
Show changes
Commits on Source (5)
%% Cell type:code id:478e81f6 tags:
``` python
#Importer elasticsearch, n'oubliez pas de l'installer via pip3 install elasticsearch
import elasticsearch
from elasticsearch import Elasticsearch
print (elasticsearch.VERSION)
```
%% Output
(8, 6, 0)
%% Cell type:code id:d8443460 tags:
``` python
#Connexion avec le server elasticsearch lancé sur Docker, il faut bien fournir le bon password, et le bon path vers http_ca.crt
ELASTIC_PASSWORD = "udw-afSAGmlLF7wT1DWz"
# Create the client instance
client = Elasticsearch(
"https://localhost:9200",
ca_certs="./http_ca.crt",
basic_auth=("elastic", ELASTIC_PASSWORD)
)
```
%% Cell type:code id:e28dace4 tags:
``` python
#tester la connexion
client.info()
```
%% Output
ObjectApiResponse({'name': 'cc70c57dc129', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'RuNPQIiSSWCF-y3IQSYbXA', 'version': {'number': '8.6.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'f67ef2df40237445caa70e2fef79471cc608d70d', 'build_date': '2023-01-04T09:35:21.782467981Z', 'build_snapshot': False, 'lucene_version': '9.4.2', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})
%% Cell type:code id:851246c7 tags:
``` python
import pandas as pd
from pandas import json_normalize
```
%% Cell type:code id:54d5c1f3 tags:
``` python
#Get raw_tweets, filter by "en"
raw_tweets = pd.read_json(r'./farmers-protest-tweets-2021-2-4.json', lines=True)
raw_tweets = raw_tweets[raw_tweets['lang']=='en']
```
%% Output
Shape: (48429, 21)
url \
0 https://twitter.com/ArjunSinghPanam/status/136...
1 https://twitter.com/PrdeepNain/status/13645062...
3 https://twitter.com/anmoldhaliwal/status/13645...
8 https://twitter.com/anmoldhaliwal/status/13645...
11 https://twitter.com/anmoldhaliwal/status/13645...
date \
0 2021-02-24 09:23:35+00:00
1 2021-02-24 09:23:32+00:00
3 2021-02-24 09:23:16+00:00
8 2021-02-24 09:22:34+00:00
11 2021-02-24 09:21:51+00:00
content \
0 The world progresses while the Indian police a...
1 #FarmersProtest \n#ModiIgnoringFarmersDeaths \...
3 @ReallySwara @rohini_sgh watch full video here...
8 @mandeeppunia1 watch full video here https://t...
11 @mandeeppunia1 watch full video here https://t...
renderedContent id \
0 The world progresses while the Indian police a... 1364506249291784198
1 #FarmersProtest \n#ModiIgnoringFarmersDeaths \... 1364506237451313155
3 @ReallySwara @rohini_sgh watch full video here... 1364506167226032128
8 @mandeeppunia1 watch full video here youtu.be/... 1364505991887347714
11 @mandeeppunia1 watch full video here youtu.be/... 1364505813834989568
user \
0 {'username': 'ArjunSinghPanam', 'displayname':...
1 {'username': 'PrdeepNain', 'displayname': 'Pra...
3 {'username': 'anmoldhaliwal', 'displayname': '...
8 {'username': 'anmoldhaliwal', 'displayname': '...
11 {'username': 'anmoldhaliwal', 'displayname': '...
outlinks \
0 [https://twitter.com/ravisinghka/status/136415...
1 []
3 [https://youtu.be/-bUKumwq-J8]
8 [https://youtu.be/-bUKumwq-J8]
11 [https://youtu.be/-bUKumwq-J8]
tcooutlinks replyCount retweetCount ... quoteCount \
0 [https://t.co/es3kn0IQAF] 0 0 ... 0
1 [] 0 0 ... 0
3 [https://t.co/wBPNdJdB0n] 0 0 ... 0
8 [https://t.co/wBPNdJdB0n] 0 0 ... 0
11 [https://t.co/wBPNdJdB0n] 0 0 ... 0
conversationId lang \
0 1364506249291784198 en
1 1364506237451313155 en
3 1364350947099484160 en
8 1364428985074032646 en
11 1364480983995584515 en
source \
0 <a href="http://twitter.com/download/iphone" r...
1 <a href="http://twitter.com/download/android" ...
3 <a href="https://mobile.twitter.com" rel="nofo...
8 <a href="https://mobile.twitter.com" rel="nofo...
11 <a href="https://mobile.twitter.com" rel="nofo...
sourceUrl sourceLabel \
0 http://twitter.com/download/iphone Twitter for iPhone
1 http://twitter.com/download/android Twitter for Android
3 https://mobile.twitter.com Twitter Web App
8 https://mobile.twitter.com Twitter Web App
11 https://mobile.twitter.com Twitter Web App
media retweetedTweet \
0 None NaN
1 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
3 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
8 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
11 [{'thumbnailUrl': 'https://pbs.twimg.com/ext_t... NaN
quotedTweet \
0 {'url': 'https://twitter.com/RaviSinghKA/statu...
1 None
3 None
8 None
11 None
mentionedUsers
0 [{'username': 'narendramodi', 'displayname': '...
1 [{'username': 'Kisanektamorcha', 'displayname'...
3 [{'username': 'ReallySwara', 'displayname': 'S...
8 [{'username': 'mandeeppunia1', 'displayname': ...
11 [{'username': 'mandeeppunia1', 'displayname': ...
[5 rows x 21 columns]
%% Cell type:code id:4ad2a607 tags:
``` python
#Unnest users et drop duplicate
users = json_normalize(raw_tweets['user'])
users.rename(columns={'id':'userId'}, inplace=True)
users.drop_duplicates(subset=['userId'], inplace=True)
users = pd.DataFrame(users)
```
%% Output
Shape: (12407, 21)
username displayname userId \
0 ArjunSinghPanam Arjun Singh Panam 45091142
1 PrdeepNain Pradeep Nain 1355092620662329349
2 anmoldhaliwal Anmol 137908912
5 ShariaActivist Sharia Ali Siddique 1362487487747121152
6 KaurDosanjh1979 Red 💚 538638801
description \
0 Global Citizen, Actor, Director: Sky is the ro...
1 Live in the sunshine where you belong
2 coming soon
5 Little Climate & Environmental Activist | Foun...
6
rawDescription descriptionUrls \
0 Global Citizen, Actor, Director: Sky is the ro... []
1 Live in the sunshine where you belong []
2 coming soon []
5 Little Climate & Environmental Activist | Foun... []
6 []
verified created followersCount friendsCount ... \
0 False 2009-06-06T07:50:57+00:00 603 311 ...
1 False 2021-01-29T09:58:06+00:00 14 134 ...
2 False 2010-04-28T03:12:18+00:00 51 27 ...
5 False 2021-02-18T19:41:57+00:00 46 106 ...
6 False 2012-03-27T23:14:32+00:00 427 1005 ...
favouritesCount listedCount mediaCount location protected \
0 4269 23 1211 False
1 240 0 102 False
2 77 0 12 Brampton, On False
5 60 0 53 she/they False
6 18962 0 30 False
linkUrl linkTcourl \
0 https://www.cosmosmovieofficial.com https://t.co/3uaoV3gCt3
1 None None
2 None None
5 None None
6 None None
profileImageUrl \
0 https://pbs.twimg.com/profile_images/121554174...
1 https://pbs.twimg.com/profile_images/136417063...
2 https://pbs.twimg.com/profile_images/156497514...
5 https://pbs.twimg.com/profile_images/136428288...
6 https://pbs.twimg.com/profile_images/135582023...
profileBannerUrl \
0 https://pbs.twimg.com/profile_banners/45091142...
1 https://pbs.twimg.com/profile_banners/13550926...
2 None
5 https://pbs.twimg.com/profile_banners/13624874...
6 https://pbs.twimg.com/profile_banners/53863880...
url
0 https://twitter.com/ArjunSinghPanam
1 https://twitter.com/PrdeepNain
2 https://twitter.com/anmoldhaliwal
5 https://twitter.com/ShariaActivist
6 https://twitter.com/KaurDosanjh1979
[5 rows x 21 columns]
%% Cell type:code id:c30e443b tags:
``` python
#Renommer quelques colonnes et drop duplicate
user_id = []
for user in raw_tweets['user']:
uid = user['id']
user_id.append(uid)
raw_tweets['userId'] = user_id
# ne garder que les colonnes utiles
cols = ['id','userId','date', 'renderedContent','likeCount']
tweets = raw_tweets[cols]
tweets.rename(columns={'id':'tweetId', 'url':'tweetUrl'}, inplace=True)
tweets = pd.DataFrame(tweets)
tweets.drop_duplicates(subset=['tweetId'], inplace=True)
tweets.head(5)
```
%% Output
C:\Users\ngoch\anaconda3\lib\site-packages\pandas\core\frame.py:5039: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().rename(
tweetId userId date \
0 1364506249291784198 45091142 2021-02-24 09:23:35+00:00
1 1364506237451313155 1355092620662329349 2021-02-24 09:23:32+00:00
3 1364506167226032128 137908912 2021-02-24 09:23:16+00:00
8 1364505991887347714 137908912 2021-02-24 09:22:34+00:00
11 1364505813834989568 137908912 2021-02-24 09:21:51+00:00
renderedContent likeCount
0 The world progresses while the Indian police a... 0
1 #FarmersProtest \n#ModiIgnoringFarmersDeaths \... 0
3 @ReallySwara @rohini_sgh watch full video here... 0
8 @mandeeppunia1 watch full video here youtu.be/... 0
11 @mandeeppunia1 watch full video here youtu.be/... 0
%% Cell type:code id:be2ad904 tags:
``` python
users = users[['userId','location']]
```
%% Output
userId location
0 45091142
1 1355092620662329349
2 137908912 Brampton, On
5 1362487487747121152 she/they
6 538638801
... ... ...
48409 714731566828793857 Mumbai, India
48413 1214383604752478208
48416 482737576 india
48417 57621609 Pune, India
48423 1355780354091720704
[12407 rows x 2 columns]
%% Cell type:code id:746551d9 tags:
``` python
#Table finale
tweets = pd.merge(tweets,users)
```
%% Cell type:code id:6366dee6 tags:
``` python
print(tweets)
```
%% Output
tweetId userId date \
0 1364506249291784198 45091142 2021-02-24 09:23:35+00:00
1 1364154888666644481 45091142 2021-02-23 10:07:24+00:00
2 1364111785486413828 45091142 2021-02-23 07:16:08+00:00
3 1364109014393626629 45091142 2021-02-23 07:05:07+00:00
4 1364000091208622082 45091142 2021-02-22 23:52:18+00:00
... ... ... ...
48424 1360040890438283264 714731566828793857 2021-02-12 01:39:51+00:00
48425 1360040704186064896 1214383604752478208 2021-02-12 01:39:06+00:00
48426 1360040623881940995 482737576 2021-02-12 01:38:47+00:00
48427 1360040616541884418 57621609 2021-02-12 01:38:45+00:00
48428 1360040267806605316 1355780354091720704 2021-02-12 01:37:22+00:00
renderedContent likeCount \
0 The world progresses while the Indian police a... 0
1 An honest piece to counteract the attempt to d... 0
2 Today hashtag for the #farmersprotest is \n\n#... 0
3 Humanity against tyranny. \n\n#FarmersProtest ... 1
4 After months of protests, over 200 deaths, 11 ... 1
... ... ...
48424 Farmers at the Singhu border have started stre... 7
48425 @sachin_rt youtu.be/smjfGmCn7x0\nHope we dont ... 0
48426 #farmersProtest getting mommentum from differ... 3
48427 if you have some interest in understanding #Fa... 4
48428 @BeingSalmanKhan Salmon Khan, wear bangles and... 0
location
0
1
2
3
4
... ...
48424 Mumbai, India
48425
48426 india
48427 Pune, India
48428
[48429 rows x 6 columns]
%% Cell type:code id:0888a459 tags:
``` python
#Create index
client.indices.create(index='farmersprotest')
```
%% Output
ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'farmersprotest'})
%% Cell type:code id:db8686c0 tags:
``` python
#Injection des données via bulk
from elasticsearch import helpers
total_number = 48429
c = 0
actions = []
n = 100
while (c < total_number):
actions = []
j = 0
while (j < total_number//n):
action = {
"_index": "farmersprotest",
"_id": tweets['tweetId'].iloc[c],
"userId":tweets['userId'].iloc[c],
"_doc": tweets['renderedContent'].iloc[c],
"date": tweets['date'].iloc[c],
"location": tweets['location'].iloc[c],
"like": tweets["likeCount"].iloc[c]
}
actions.append(action)
j += 1
c += 1
if c == 48429:
break
helpers.bulk(client, actions)
```
%% Cell type:code id:f4213a8f tags:
``` python
client.search(index='farmersprotest',body={"query": {"match_all": {}}})
```
%% Output
C:\Users\ngoch\AppData\Local\Temp/ipykernel_26524/3406525718.py:1: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.
client.search(index='farmersprotest',body={"query": {"match_all": {}}})
ObjectApiResponse({'took': 824, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 10000, 'relation': 'gte'}, 'max_score': 1.0, 'hits': [{'_index': 'farmersprotest', '_id': '1364506249291784198', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': 'The world progresses while the Indian police and Govt are still trying to take India back to the horrific past through its tyranny. \n\n@narendramodi @DelhiPolice Shame on you. \n\n#ModiDontSellFarmers \n#FarmersProtest \n#FreeNodeepKaur twitter.com/ravisinghka/st…', 'date': '2021-02-24T09:23:35+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1364154888666644481', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'An honest piece to counteract the attempt to dehumanise the protesting Sikhs in Denmark and Europe. \n\n#farmersprotest twitter.com/sikharchive/st…', 'date': '2021-02-23T10:07:24+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1364111785486413828', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Today hashtag for the #farmersprotest is \n\n#Pagdi_Sambhal_Jatta twitter.com/kisanektamorch…', 'date': '2021-02-23T07:16:08+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1364109014393626629', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Humanity against tyranny. \n\n#FarmersProtest \n#humanrights twitter.com/ravisinghka/st…', 'date': '2021-02-23T07:05:07+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1364000091208622082', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': 'After months of protests, over 200 deaths, 11 meetings. \n\n@nstomar still hasn’t figured out the issue. \n\nSir take the farmer’s offer of an open public debate and you’ll be the most informed, then you’ll be first in line to get the laws repealed. \n\n#farmersprotest #freenodeepkaur https://t.co/lMyE2E1kLp', 'date': '2021-02-22T23:52:18+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1363997412424040448', '_score': 1.0, '_source': {'userId': 45091142, '_doc': '22/02/21 Updates from @jagjitvaheguru 🙏🏽\n\n#farmersprotest twitter.com/jagjitvaheguru…', 'date': '2021-02-22T23:41:39+00:00', 'location': '', 'like': 0}}, {'_index': 'farmersprotest', '_id': '1363760114969288708', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Humanity against tyranny. \n\n#FarmersProtest twitter.com/reutersindia/s…', 'date': '2021-02-22T07:58:43+00:00', 'location': '', 'like': 2}}, {'_index': 'farmersprotest', '_id': '1363753090181197824', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': 'Not just a few mislead people. \n\n@narendramodi These are the people who put you in the position of governing the country. \n\nThey didn’t put you there to be ignored, violently attacked, sexually abused or have their food and water cut off by your @DelhiPolice\n\n#farmersprotest twitter.com/reuters/status…', 'date': '2021-02-22T07:30:48+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1363613574317408266', '_score': 1.0, '_source': {'userId': 45091142, '_doc': 'Today’s update 21/02/21: \n\nThank you @jagjitvaheguru 🙏🏽\n\n#FarmersProtest \n#MSPLawForAllCrops \n#ReleaseNodeepKaur twitter.com/jagjitvaheguru…', 'date': '2021-02-21T22:16:25+00:00', 'location': '', 'like': 1}}, {'_index': 'farmersprotest', '_id': '1363029589694574594', '_score': 1.0, '_ignored': ['_doc.keyword'], '_source': {'userId': 45091142, '_doc': '@MankamalSingh @Punjab2000music @SikhPA @minkaur5 @SCUKofficial @SikhFedUK @sikhsinscotland @DrJasjitSingh Their response completely ignores the fact that Geeta cuts the speaker off and says, \n“it’s not propaganda.”\n\nIn response to @DrSharandeep stating that the govt of Indian is spreading propaganda regarding the farmer’s protests being hijacked. \n\n@BBCNews \n\n#FarmersProtest https://t.co/Tp5A3ajVuE', 'date': '2021-02-20T07:35:52+00:00', 'location': '', 'like': 3}}]}})
%% Cell type:code id:e2b8835e tags:
``` python
```
This diff is collapsed.
......@@ -240,16 +240,59 @@ Résultat attendu
#### Q9. Calculer la proportion de personnes dont le solde est strictement inférieur à 1000
(pas sur pour cette question)
## Partie 2 : API Elasticsearch
Dans cette partie, on va commencer à mettre en place un outil d'analyse Twitter avec ES, pour ce faire, il y a deux possibilités
## Partie 2 : Elasticsearch Client sur Python
1. Utiliser la data set "", elle contient les tweets qui ont la hashtag "#FarmersProtest"
2. Installer logstash et créer un compte de développeur Twitter, nous utiliserons les tokens et les clés généré par Twitter pour extraire les tweets contenant des hashtags "Grève" à la période xxxxxxx
### 2.1 : Découvrir l'ES dans Python
Pour cette partie là, on exploitera comment intéragir avec Elasticsearch sur Python. Vous pouvez utiliser IDE comme vous voulez.
**Note** : à part d'Elasticsearch, d'ici, on a également besoin de librairie pandas
#### Q1. Installer Elasticsearch, importer dans Python
### 2.2 : Découvrir Logstash
```bash
docker run --rm -it -v ./twitter_pipeline.conf/:/usr/share/logstash/pipeline/ docker.elastic.co/logstash/logstash:8.6.0
>pip3 install elasticsearch
Dans votre script python :
``` bash
import elasticsearch
from elasticsearch import Elasticsearch
#Voir la version elasticsearch :
print (elasticsearch.VERSION)
```
D'ici, j'utilise la version 8.6.0
#### Q2. Connecter avec le server Elasticsearch
``` bash
ELASTIC_PASSWORD = {Votre password}
client = Elasticsearch(
"https://localhost:9200",
ca_certs="path/vers/http_ca.cert",
basic_auth=("elastic", ELASTIC_PASSWORD))
#Il faut checker si tout va bien en lancant :
client.info()
```
La réponse doit rassemble à cela :
![attendu2](/images/connectapipython.PNG)
Vous pouvez également prendre les données qu'on a créé dans la 1ère partie :
``` bash
client.search(index='customer',body={"query": {"match_all": {}}})
```
#### Q3. Traitement les données
Pour la suite de TP, nous vous donnons un dataset qui est les tweets qui ont hashtag #Farmersprotest. Dans cette question, nous avons besoin de nettoyer les données avant injecter dans Elasticsearch
#### 3.1. Lire les données au format JSON
Hint: Utiliser pandas.read_json('/path/vers/des/données',lines=True)
#### 3.2. Filtrer
Nous n'utilisons que les tweets en Anglais et nous ne voulons pas de doublons. En plus, il n'y a que les colonnes : id, date, user, renderedContent qui nous servirons pour la suite. Renommer le champ id -> tweetID pour distinguer. La table résultant appelé raw_tweets
Appeler la fonction head() pour voir les résultats
#### 3.3. Flatten nested champ
Nous voyons que le champ "user" est nested, essayons de créér un table à partir de ce champ. Supprimer de doublons. Nous ne gardons que les colonnes : id, location. Renommer le champ id -> userID pour distinguer. La table résultant appelé users
Hint: Utiliser pandas.json_normalize, il faut importer avant utiliser
#### 3.4. Créer la table finale
La table finale appelé "tweets" est créé en mergent les 2 tables précédentes.
#### Q4. Injection les données dans Elasticsearch
```
\ No newline at end of file
## Partie 3 : Agrégation de données et visualisation dans Kibana
\ No newline at end of file
images/connectapipython.PNG

14.2 KiB