During my most recent conversation with ChatGPT, I made the most amazing discovery: it turns out that the AI language model is able to interact with emoticons and generate stories based on them.
I experimented by sending a number of emoticons to ChatGPT, and the program turned them into a hilarious story that kept me on the edge of my seat the whole time. The creativity and inventiveness displayed by ChatGPT astounded me.
It is interesting to watch how ChatGPT can turn something as simple as an emoji into something as complicated and engaging as a story. If you haven’t tried it yet, I strongly advise you to do so. New creative ways of interacting with the model are discovered every day.
The advent of widespread use of AI promises to revolutionize not only the business world, but also our everyday life. As an illustration, consider the rise of conversational robots like GPT (Generative Pre-trained Transformer), which can carry on natural-sounding conversations and carry out duties like customer service and data collection.
There may be major shifts in the job market as a result of the widespread use of chatbots and other forms of AI technology. Changing job markets and the need for previously unneeded skills may result from the increased usage of AI and automation to replace people. On the other hand, it has the potential to enhance productivity and generate new employment opportunities.
Significant as well are the geopolitical ramifications of AI. As AI becomes more commonplace, governments who are able to develop and deploy these technologies more quickly may have an advantage in terms of economic growth and military capability. The potential for the misuse of AI in military or espionage situations is a big issue, and this could lead to an AI “arms race” between governments.
Adopting AI may also have far-reaching effects on society. There is concern that industries like customer service, which rely heavily on human connection, may see a decline as a result of rising use of automation and machine learning algorithms. Social harmony and interpersonal relationships may suffer as a result.
The overall economic and social effects of AI are intricate and varied. In order to avoid unintended repercussions, governments, organizations, and individuals must responsibly create and utilize AI.
I selected this dataset because of the big number of variables to explore and use in my model, and because the data seemed very clean. I also liked the challenge of predicting a multi-categorical target feature. The competition has been around since 2015, meaning I was able to find a good amount of posts in the “Discussion” section of the competition, to learn and improve.
The description of the seven levels of the target feature are the following. In the dataset, however, are expressed in integers in the last column “Cover_Type”:
Spruce/Fir
Lodgepole Pine
Ponderosa Pine
Cottonwood/Willow
Aspen
Douglas-fir
Krummholz
The “train” set of the competition includes 15,120 observations, while the “test” set 565,892 observations. The descriptive features are, in order:
Elevation – Elevation in meters
Aspect – Aspect in degrees azimuth
Slope – Slope in degrees
Horizontal_Distance_To_Hydrology – Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology – Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways – Horz Dist to nearest roadway
Hillshade_9am (0 to 255 index) – Hillshade index at 9am, summer solstice
Hillshade_Noon (0 to 255 index) – Hillshade index at noon, summer solstice
Hillshade_3pm (0 to 255 index) – Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points – Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) – Wilderness area designation
Soil_Type (40 binary columns, 0 = absence or 1 = presence) – Soil Type designation
Cover_Type (7 types, integers 1 to 7) – Forest Cover Type designation
Data exploration
One of the first element I checked through visualisation was the balance of the dataset. I discovered that the dataset was perfectly balanced: the 7 levels of the target feature had the same number of observations across the “train” dataset.
Plotting the histogram of all continuous values, I saw that none of them followed a normal distribution.
With the help of visualisation, I discovered interesting relationships between “Aspect”, “Hillshade” columns (3), and the columns describing horizontal distances.
There were no missing values in the dataset, however I was concerned by some continuous features having a good number of “0”. I wrote a function that identified those features and then I read the dataset description on the competition Kaggle URL address. I decided that those “zeros” were legit and had to be left as they were (for example, “Slope”, “Aspect”, distance from hydrology or roadways). Instead of being error or missing values, were correct values.
As I did not encounter normal distributions, I decided to not investigate for outliers, as their presence could be beneficial to my model. The histograms visualised of continuous features, also, did not show to me concerning outliers.
The only two categorical features of the dataset, “Soil_Type” and “Wilderness_Area”, were already one-hot encoded by the dataset provider in binary columns. Even if this type of encoding is ideal for the KNN algorithm, it aggravates the issue of dimensionality, especially for the “Soil_Type” category (consisting of 40 levels). For this reason, I decided to reverse this encoding by creating two new columns: “Soil_Type_All” and “Wilderness_Area_All”; I will test the two types of encodings in my iterations to assess what works best.
# I reverse the one-hot encoding of Soil_Type and Wilderness_Area, creating two new columns
soil_start = trees.columns.get_loc("Soil_Type1")
soil_end = trees.columns.get_loc("Soil_Type40")
trees.insert(soil_end+1,'Soil_Type_All', trees.iloc[:, soil_start:-1].idxmax(1))
area_start = trees.columns.get_loc("Wilderness_Area1")
area_end = trees.columns.get_loc("Wilderness_Area4")
trees.insert(area_end+1,'Wilderness_Area_All', trees.iloc[:, area_start:area_end+1].idxmax(1))
The model: K-nearest neighbour
Because of the high number of continuous features in the dataset, I decided to go with a similarity-based learning, in particular K-nearest neighbour.
As first step, I normalised the continuous features, as the KNN is very sensitive to not-normalised data.
I transformed the values in the newly created categorical columns in integers, in order to be utilised by my model. I was conscious this is not ideal, as it would imply an order between the levels of the categories; however, I still thought that it was better than train the model on too many features.
Before running the first iterations, for local evaluation purposes, I split the “train.csv” dataset in train (80%) and test (20%) data. In order to maintain the target feature balance, I used the parameter “stratify” of the “train_test_split” function.
Model iterations
Being a perfectly balanced dataset, I knew I could try high “k” values. My first iteration (“M1”) involved trying different values for “k”. I iterated through 3,5,6 and 8; and discovered that the best value of “k” was 5, as the model gave me an accuracy of 76%.
My second iteration involved trying a different distance metric than Euclidean: the Manhattan, as the latter is less influenced by single large differences in single features. The model accuracy was 74%, less than before. For this reason, I decided to use the Euclidean distance (default) for my following iterations.
For my third iteration, I decided to reduce the number of features on which to train the model, with the help of the SkLearn function “SelectKBest”. The function selected for me the top 8 features (out of 10), the accuracy of the third model (“M3”) was still just above 76%.
# I use the SelectKBest function to select the most 8 useful features for my model
selector = SelectKBest(chi2, k=8)
selector.fit(X_train, y_train)
cols = selector.get_support(indices=True)
X_train_new = X_train.iloc[:,cols]
sel_col =[]for col in X_train_new:
sel_col.append(X_test.columns.get_loc(X_train_new[col].name))# KNN classifier built with the previous selected features only
mod3 = KNeighborsClassifier(n_neighbors=5)
mod3.fit(X_train_new, y_train)
y_pred = mod3.predict(X_test.iloc[:,sel_col])print("M3 Accuracy score: "+str(accuracy_score(y_test, y_pred)))
Before proceeding with the 4th iteration, I tested the difference between feeding the model with the one-hot encoded features “Wilderness_Area” (4 binary columns), or use the created column “Wilderness_Area_All”, with integer encoding. The accuracy of the two KNN classifiers were exactly the same.
For my following iteration, I included the “Wilderness_Area_All” and “Soil_Type_All” features in the model training. This greatly improved my model (“M4”), that scored an accuracy of 80%.
My 5th iteration involved changing the number of top features selected from 8 to 6. This proved to be successful, with an accuracy score for “M5” of 83%.
In my 6th iteration, I ran the same model but “distance-weighted” KNN algorithm, by adding the parameter “weights”. The reasoning was based on the fact the closer neighbours should have more weight in the distances calculation. “M6” was the most successful model found, scoring an accuracy of 85%.
# I test a distance weighted KNN
mod6 = KNeighborsClassifier(n_neighbors=5,weights='distance')
mod6.fit(X_train_new, y_train)
y_pred = mod6.predict(X_test.iloc[:,sel_col])print("M6 Accuracy score: "+str(accuracy_score(y_test, y_pred)))
After checking single-class accuracy, I noticed that for three classes I achieved an accuracy of over 95%, while for one my accuracy was only 67%. This is an interesting point for future investigations.
For my last iteration, I engineered a new feature “Hillshade_Avg”, created by averaging the three features “Hillshade_9am”, “Hillshade_Noon” and “Hillshade_3pm”. I then trained my model with it, instead of the original three features, but the accuracy of “M7” was exactly the same of “M6”.
The best performing KNN model found, at the end of all iterations, was “M6”.
Kaggle performance report
I ran my model on the Kaggle competition “test” dataset, and submitted my predictions. My score was 0.68217. This result was slightly disappointing, given the high accuracy achieved in my local evaluation of the model.
Future improvements
With more at my disposal, I would improve my model in the following ways:
Use of 10-fold cross-validation in my local evaluation strategy.
Deeper data exploration, by:
better investigating features relationships and correlation.
Investigate outliers, even if the distributions of continuous features are not-normal.
Features engineering: I would create new features that would describe the relationship between the original features, and use them to train my model.
Investigate low-accuracy classes predictions, by looking at the model accuracy by class.
Testing a decision tree algorithm, after binning the continuous descriptive features.
Group the 40 different “Soil_Types” in fewer categories.
Full code
import pandas as pd
import numpy as np
import matplotlib . pyplot as plt
import seaborn as sns
import math
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import confusion_matrix
# I import the training data, and check its main characteristics
trees = pd.read_csv("train.csv")print(trees.shape)
trees = trees.set_index('Id')print(trees.shape)
trees.dtypes
print(trees.describe())# Check for missing values
trees.isnull().any()# A funtion to check min and max of each featurefor column in trees:print(column +': '+str(trees[column].min())+' - '+str(trees[column].max()))# I reverse the one-hot encoding of Soil_Type and Wilderness_Area, creating two new columns
soil_start = trees.columns.get_loc("Soil_Type1")
soil_end = trees.columns.get_loc("Soil_Type40")
trees.insert(soil_end+1,'Soil_Type_All', trees.iloc[:, soil_start:-1].idxmax(1))
area_start = trees.columns.get_loc("Wilderness_Area1")
area_end = trees.columns.get_loc("Wilderness_Area4")
trees.insert(area_end+1,'Wilderness_Area_All', trees.iloc[:, area_start:area_end+1].idxmax(1))# Visualise the target feature and the categorical feature Wilderness_Area
fig, ax = plt.subplots(1,2, figsize=(17,5))
ax[0].bar(trees['Cover_Type'].unique(), trees['Cover_Type'].value_counts())
ax[1].bar(trees['Wilderness_Area_All'].unique(), trees['Wilderness_Area_All'].value_counts())
plt.show()# Visualise all continuous descriptive features
fig, ax = plt.subplots(2,4,figsize=(17,6))
ax[0,0].hist(trees.iloc[:,0])
ax[0,0].set_title(trees.columns[0])
ax[0,1].hist(trees.iloc[:,1])
ax[0,1].set_title(trees.columns[1])
ax[0,2].hist(trees.iloc[:,2])
ax[0,2].set_title(trees.columns[2])
ax[0,3].hist(trees.iloc[:,3])
ax[0,3].set_title(trees.columns[3])
ax[1,0].hist(trees.iloc[:,4])
ax[1,0].set_title(trees.columns[4])
ax[1,1].hist(trees.iloc[:,5])
ax[1,1].set_title(trees.columns[5])
ax[1,2].hist(trees.iloc[:,9])
ax[1,2].set_title(trees.columns[9])
ax[1,3].axis('off')
plt.tight_layout(h_pad=3)
plt.show()# Visualise in detail Elevation and Aspect
fig, ax = plt.subplots(1,2,figsize=(20,8))
ax[0].hist(trees.iloc[:,0], bins=30)
ax[0].set_title(trees.columns[0])
ax[1].hist(trees.iloc[:,1], bins=30)
ax[1].set_title(trees.columns[1])
plt.show()# Huge pairwise relationships plot of all continuous values; it may take time to load
sns.set()#sns_plot = sns.pairplot(trees.iloc[:,0:10])#sns_plot.savefig("pairplot.png")# Detail of interesting relationships, with detail target feature plotted with color
sns.pairplot(trees.iloc[:,[0,1,56]], hue='Cover_Type');
sns.pairplot(trees.iloc[:,[6,7,8,56]], hue='Cover_Type');# Histogram of main continuous features with the Cover_Type dimension on color
trees.pivot(columns="Cover_Type", values="Aspect").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Slope").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Elevation").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Hillshade_Noon").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Horizontal_Distance_To_Hydrology").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Vertical_Distance_To_Hydrology").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Horizontal_Distance_To_Roadways").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Horizontal_Distance_To_Fire_Points").plot.hist(bins=50)
plt.show()# Boxplots of main continuous features by Cover_Type
plt.clf()
sns.boxplot(x="Cover_Type", y="Slope", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Elevation", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Aspect", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Hillshade_Noon", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Horizontal_Distance_To_Hydrology", data=trees)
plt.show()
plt.clf()# Normalisation of continuous features
min_max_scaler = preprocessing.MinMaxScaler()
trees.iloc[:,:10]= min_max_scaler.fit_transform(trees.iloc[:,:10])# I transform the categorical values in integers
trees['Wilderness_Area_All']= trees['Wilderness_Area_All'].str.replace('Wilderness_Area','').astype(int)
trees['Soil_Type_All']= trees['Soil_Type_All'].str.replace('Soil_Type','').astype(int)# I select the first 10 features, and split the dataset in training and test, for local evaluation purposes
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,:10], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)# I test various k values# First KNN classifier with K = 3, Euclidean distance (default)
mod1 = KNeighborsClassifier(n_neighbors=3)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)print("M1 K3 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# KNN classifier with K = 5
mod1 = KNeighborsClassifier(n_neighbors=5)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)print("M1 K5 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# First KNN classifier with K = 6
mod1 = KNeighborsClassifier(n_neighbors=6)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)print("M1 K6 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# First KNN classifier with K = 8
mod1 = KNeighborsClassifier(n_neighbors=8)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)print("M1 K8 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# I test a different type of distance metric# KNN classifier with K = 5, manhattan distance
mod2 = KNeighborsClassifier(n_neighbors=5, metric="manhattan")
mod2.fit(X_train, y_train)
y_pred = mod2.predict(X_test)print("M2 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# I use the SelectKBest function to select the most 8 useful features for my model
selector = SelectKBest(chi2, k=8)
selector.fit(X_train, y_train)
cols = selector.get_support(indices=True)
X_train_new = X_train.iloc[:,cols]
sel_col =[]for col in X_train_new:
sel_col.append(X_test.columns.get_loc(X_train_new[col].name))# KNN classifier built with the previous selected features only
mod3 = KNeighborsClassifier(n_neighbors=5)
mod3.fit(X_train_new, y_train)
y_pred = mod3.predict(X_test.iloc[:,sel_col])print("M3 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# Test to assess the impact of two different encodings for the categorical values # KNN with "Wilderness_Area" encoded in integers a single columnprint(trees.columns.get_loc("Wilderness_Area_All"))
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,14]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)
modt = KNeighborsClassifier(n_neighbors=5)
modt.fit(X_train, y_train)
y_pred = modt.predict(X_test)print("Test with 'integer' encoding accuracy score: "+str(accuracy_score(y_test, y_pred)))# KNN with "Wilderness_Area" one-hot encoded, as it was originally in the dataset
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,10,11,12,13]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)
modt2 = KNeighborsClassifier(n_neighbors=5)
modt2.fit(X_train, y_train)
y_pred = modt2.predict(X_test)print("Test with 'one-hot' encoding accuracy score: "+str(accuracy_score(y_test, y_pred)))# For the next classifier, I include the "Wilderness_Area_All" and "Soil_Type_All" features
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,5,6,7,8,9,14,55]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)
mod4 = KNeighborsClassifier(n_neighbors=5)
mod4.fit(X_train, y_train)
y_pred = mod4.predict(X_test)print("M4 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# I repeat the process of automatic feature selection, but this time I select 6 instead of 8 top features
selector = SelectKBest(chi2, k=6)
selector.fit(X_train, y_train)
cols = selector.get_support(indices=True)
X_train_new = X_train.iloc[:,cols]
sel_col =[]for col in X_train_new:
sel_col.append(X_test.columns.get_loc(X_train_new[col].name))
mod5 = KNeighborsClassifier(n_neighbors=5)
mod5.fit(X_train_new, y_train)
y_pred = mod5.predict(X_test.iloc[:,sel_col])print("M5 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# I test a distance weighted KNN
mod6 = KNeighborsClassifier(n_neighbors=5,weights='distance')
mod6.fit(X_train_new, y_train)
y_pred = mod6.predict(X_test.iloc[:,sel_col])print("M6 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# Check single class accuracy
m = confusion_matrix(y_test, y_pred)
cm = cm.astype('float')/ cm.sum(axis=1)[:, np.newaxis]print(cm.diagonal())# I create a new "Hillshade" column, by averaging "Hillshade_9am", "Hillshade_Noon" and "Hillshade_3pm"
shade_start = trees.columns.get_loc("Hillshade_9am")
shade_end = trees.columns.get_loc("Hillshade_3pm")
trees.insert(shade_end+1,'Hillshade_Avg',(trees.iloc[:,shade_start]+ trees.iloc[:,shade_start+1]+ trees.iloc[:,shade_end])/3)# KNN classifier that includes the newly created feature
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,5,9,10,15,56]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)
selector2 = SelectKBest(chi2, k=6)
selector2.fit(X_train, y_train)
cols2 = selector2.get_support(indices=True)
X_train_new2 = X_train.iloc[:,cols2]
sel_col2 =[]for col in X_train_new2:
sel_col2.append(X_test.columns.get_loc(X_train_new2[col].name))
mod7 = KNeighborsClassifier(n_neighbors=5,weights='distance')
mod7.fit(X_train_new2, y_train)
y_pred = mod7.predict(X_test.iloc[:,sel_col2])print("M7 Accuracy score: "+str(accuracy_score(y_test, y_pred)))# Preparation of the Kaggle test dataset, I make the same transformation made to the train dataset
kaggle_test = pd.read_csv("test.csv")
kaggle_test = kaggle_test.set_index('Id')
soil_start = kaggle_test.columns.get_loc("Soil_Type1")
soil_end = kaggle_test.columns.get_loc("Soil_Type40")
kaggle_test.insert(soil_end+1,'Soil_Type_All', kaggle_test.iloc[:, soil_start:-1].idxmax(1))
area_start = kaggle_test.columns.get_loc("Wilderness_Area1")
area_end = kaggle_test.columns.get_loc("Wilderness_Area4")
kaggle_test.insert(area_end+1,'Wilderness_Area_All', kaggle_test.iloc[:, area_start:area_end+1].idxmax(1))
kaggle_test.iloc[:,:10]= min_max_scaler.fit_transform(kaggle_test.iloc[:,:10])
kaggle_test['Wilderness_Area_All']= kaggle_test['Wilderness_Area_All'].str.replace('Wilderness_Area','').astype(int)
kaggle_test['Soil_Type_All']= kaggle_test['Soil_Type_All'].str.replace('Soil_Type','').astype(int)# Dataset split and target feature prediction with the best model: "M6"
kaggle_test_sel = kaggle_test.iloc[:,[0,1,2,3,4,5,6,7,8,9,14,55]]
kaggle_pred = mod6.predict(kaggle_test_sel.iloc[:,sel_col])# Export to CSV for submission
submission ={'Id': kaggle_test.index,'Cover_Type': kaggle_pred
}
subm = pd.DataFrame(submission, columns =['Id','Cover_Type'])
subm.to_csv(r'xxx.csv', index =True)
Can you imagine if you could sponsor the published content of any company page on LinkedIn? Well.. you can.
I discovered this, by chance, exploring the security features of the LinkedIn Ads platform. As a first warning, I discovered that I could create an account connected to any company page on the platform. For my test, I created a new ads account connected to the Google LinkedIn company page.
Not a big deal, I thought, as connecting the page would not necessarily mean that I would be automatically authorized to advertise on its behalf. I proceeded by filling out the details of the campaigns. When I reached the ad selection page, I couldn’t believe my eyes.
The option “Create new ad” was greyed out, however I could select the ad creative from a list of already published posts by the Google LinkedIn company page. I selected this post from the list.
I launched the campaign, and it worked! I was now advertising on behalf of Google, sponsoring one of their posts. The campaign started accruing some impressions and clicks, at which point I stopped it.
In order to test that this was not an isolated case, I replicated the test by creating a new account connected, this time, to the Prada Group company page. For this experiment I selected a “seasoned” post (2 years of age) from their published content list. And yet again, the campaign started smoothly.
It can be argued that there is no real issue here, as it would be impossible to damage a company by sponsoring its own content. If this is partly true, it is also the case that seeing very old content could confuse the users. Imagine, in particular, a situation where a company changes its standpoint in time on a certain matter, and a potential malicious actor sponsors one of its old posts, in which the company stated the opposite of its own most recent viewpoint. I can’t see how this wouldn’t cause potential havoc.
I immediately notified LinkedIn of this and I am currently waiting for an answer.
Did anyone of you ever noticed the same?
Do you think this happens by design, or did I just found a bug?
Could you think of other possible malicious exploitation scenarios?
Google Trends is a great tool for checking relative search volumes trends for keywords or topics, and to compare them with each other. The tool allows you to download the main trend graph points in a CSV file.
Some time ago I was practicing calculating linear regression with Excel on Google Trends sets of 2-year data. I chose a time-range of 2 years to account for possible seasonality. I wondered if I could automate this process with a script and output a single metric (the slope value), as an indication of the search trend of the past 2 years. A negative number would mean that searches are decreasing, while a positive number means that the trend is positive.
After a few searches I became aware of Scipy, a Python open-source library for scientific and technical computing that includes mathematical functions, including regressions.
I started this project by analyzing how Google Trends constructs its web requests in the browser with the Google Chrome “Network” tool, and then I transferred the learnings in a Scrapy script (my favorite scraping library).
The result is the script shared with you at the end of this article; it can be called server-side with the following command:
scrapy runspider -L WARNING filename.py
In order for the script to work, both Scrapy and Scipy libraries should be installed on your server, and imported at the beginning of the script, along with the “datetime” and “json” modules. The code is ready to be used, just make sure to substitute the “keyword” value with the term you want to query trends for. The default data points range is set in weeks, and the slope is calculated by pairing the numbers from 1 to 104 (the numbers of weeks) with the value for each week extracted.
I hope you will make good use of the script for your purposes, and please don’t hesitate to ask me if you have any doubts about its functions.
import scrapy
import json
from datetime import date, timedelta
from scipy.stats import linregress
# Calculate the current date and the same day of 2 years before, in the format YYYY-MM-DD
today =str(date.today())
mo = date.today().strftime('%m')
da = date.today().strftime('%d')
ye = date.today().year
old =str(ye-2)+'-'+mo+'-'+da
today =str(ye)+'-'+mo+'-'+da
# We will query 2 years of data, therefore 104 weeks
weeks =list(range(1,105))# Customize this with the keyword you want to query trends for
keyword ='bitcoin'# URLs constructors
url1 ='https://trends.google.com/trends/api/explore?hl=en-US&tz=-60&req={"comparisonItem":[{"keyword":"'
url2 ='","geo":"IE","time":"'+old+' '+today+'"}],"category":0,"property":""}&tz=-60'
url3 =('https://trends.google.com/trends/api/widgetdata/multiline?hl=en-US&tz=-60&req={"time":"'+old+' '+today+'","resolution":"WEEK","locale":"en-US","comparisonItem":[{"geo":{"country":"IE"},"complexKeywordsRestriction":'+'{"keyword":[{"type":"BROAD","value":"')
url4 ='"}]}}],"requestOptions":{"property":"","backend":"IZG","category":0}}&token='
url5 ='&tz=-60'# Scrapy scraperclassTrendsSpider(scrapy.Spider):
name ="trends2"
handle_httpstatus_list =[429,500]defstart_requests(self):# First we crawl the Google Trends root domain, in order to receive a valid session cookie
url ='https://trends.google.com/'yield scrapy.Request(url, callback=self.parse,headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'})defparse(self, response):# The request that will return a JSON file with the tokens for each widget
newurl = url1+keyword+url2
return scrapy.Request(newurl,callback=self.parse_1)defparse_1(self, response):# We have to delete the first 4 characters of the response as not relevant
jsonresponse = json.loads(response.text[5:])# We parse the JSON response, capture the token for the "Interest over time" widget and construct the new URL with it
kwd = jsonresponse['widgets'][0]['request']['comparisonItem'][0]['complexKeywordsRestriction']['keyword'][0]['value']
tok= jsonresponse['widgets'][0]['token']
finalurl =(url3 + kwd + url4 + tok + url5)# We pass the keyword to the next parser functionreturn scrapy.Request(finalurl,callback=self.parse_2,meta={'keyword':kwd})defparse_2(self, response):
jsonresponse2 = json.loads(response.text[6:])# We insert the weekly values in an array
values =[]for i in jsonresponse2['default']['timelineData']:
values.extend(i['value'])iflen(values)==0:
slope =0else:# The following function returns the slope of the calculated linear regression by combining the weeks numbers (1-104) with the values
slope ="%.2f"% linregress(weeks, values).slope
# Print the keyword with the slope value print(response.meta['keyword']+': '+slope)
In recent times, if you want to check what’s on in the cinema in your city, your only option was to sort through the online listings by movie or by cinema. My girlfriend told me that newspapers – years ago – used to list movies by hour, and this sorting method was extremely handy to consult on the same day.
As at the time I was practicing web scraping with Scrapy I decided to give it a go, fill this gap in the online cinema listing space and create a website that she and other people can use, with the look & feel of an old newspaper.
I created Finema by writing a very simple page of code that includes PHP, HTML and JavaScript together. The PHP code queries and fetches in real-time a JSON file from my personal server with the cinema listings data. The JSON file on the server is generated by a developed Scrapy script that captures the daily movies timings from a popular Irish online entertainment website every few hours.
All movies of the selected cinemas are listed by hour, starting from the first movie of the day and finishing with the last one. By clicking on the cinema name, it is possible to visualize the daily shows in that particular cinema only, while the drop-down allows to select a specific movie. The two filters can be combined.
The risk of this type of aggregators lies in the source of the scraped data: if the website from which the data is taken changes its HTML code, the scraping script can break and return corrupted or no data. This is why it is important to put in place regular automatic checks and notifications, in order to immediately amend your scripts as the breaking happens.
Also, if you are thinking to create a similar project, remember to check the Terms & Conditions of the website from which you are capturing the data, as they may forbid scraping of their content.
Would a similar website be useful in your city, too? Write me and we can make it happen!
Copy the code in this repository in new scripts files inside your new App Script project
In the postUpdater.gs code, substitute the text in square brackets at lines 13, 14, 15 and 16 with your WordPress database credentials
In the same file at line 22, substitute [POST ID] with the ID of the WordPress post you want to update
Save all files in the App Script project and close it
Refresh the document page, update its content and click on the menu to ‘WordPress > Update post content’
Your WordPress post content will be automatically updated with the content of your Google document!
Note: The script will automatically append and remove HTML tags to your document, in order to transfer formatting styles, lists and links.
links.gs
functionlinks(){var body = DocumentApp.getActiveDocument().getBody();var text = body.editAsText();var idc = text.getTextAttributeIndices();var bold =[];var bold2 =[];var bold3 =[];var bold4 =[];var bold5 =[];var bold6 =[];// Retrieve the exact indices where links are, and save it in arrays together with the links URLs for(var i =0; i < idc.length; i++){if(text.getLinkUrl(idc[i])!=null){
bold.push(idc[i]);
bold2.push(idc[i+1]);
bold3.push(text.getLinkUrl(idc[i]));
bold4.push(text.getLinkUrl(idc[i]).length);var sum = bold4.reduce(function(a, b){return a + b;},0);
bold5.push(sum);}}// Insert HTML for links
text.insertText(bold[0],"<a href=''"+ bold3[0]+"''>");
text.insertText(bold2[0]+13+bold3[0].length,"</a>");for(var i =1; i < bold.length; i++){if(bold2[i]!= undefined){
text.insertText(bold[i]+(17*i)+(bold5[i-1]),"<a href=''"+ bold3[i]+"''>");
text.insertText(bold2[i]+17+bold3[0].length+13+bold3[i].length,"</a>");}else{
text.insertText(bold[i]+(17*i)+(bold5[i-1]),"<a href=''"+ bold3[i]+"''>");var eh = body.getText();
text.insertText(eh.length,"</a>");}}}
list.gs
functionlist(){var body = DocumentApp.getActiveDocument().getBody();var text = body.editAsText();var list = body.findElement(DocumentApp.ElementType.LIST_ITEM).getElement().editAsText();var bold =[];// Append HTML list tags to the lists elements for(var i =0; i < DocumentApp.getActiveDocument().getNumChildren(); i++){var firstChild = DocumentApp.getActiveDocument().getBody().getChild(i);if(firstChild.getType()== DocumentApp.ElementType.LIST_ITEM){
firstChild.replaceText(firstChild.getText(),"<li>"+ firstChild.getText()+"</li>");var childIndex = firstChild.getListId();
bold.push(i);}}// Append HTML main <ul> or <ol> lists tags to the lists for(var i =0; i < bold.length; i++){if(bold[i-1]!= bold[i]-1){var firstChild = body.getChild(bold[i]);if(firstChild.getGlyphType()!= DocumentApp.GlyphType.NUMBER){
firstChild.replaceText(firstChild.getText(),"<ul>"+ firstChild.getText());}else{
firstChild.replaceText(firstChild.getText(),"<ol>"+ firstChild.getText());}}}for(var i =0; i < bold.length; i++){if(bold[i+1]!= bold[i]+1){
Logger.log(i);var firstChild = body.getChild(bold[i]);if(firstChild.getGlyphType()!= DocumentApp.GlyphType.NUMBER){
firstChild.appendText("</ul>");}else{
firstChild.appendText("</ol>");}}}}
postFormatter.gs
functionpostFormatter(){// This script adds <strong> and <em> tags to the text, preparing it to be uploaded in the WordPress database var body = DocumentApp.getActiveDocument().getBody();var text = body.editAsText();var idc = text.getTextAttributeIndices();var bold =[];var bold2 =[];var bold3 =[];// Identify where the Bold text is (indexes), and push it into arrays for(var i =0; i < idc.length; i++){if(text.isBold(idc[i])&& idc[i+1]!== undefined){
bold.push(idc[i]);
bold2.push(idc[i+1]);}elseif(text.isBold(idc[i])&& idc[i+1]== undefined){
bold.push(idc[i]);}}for(var i =0; i < bold.length; i++){if(bold2[i]!== undefined){
bold3.push(text.getText().slice(bold[i], bold2[i]));}else bold3.push(text.getText().slice(bold[i]));}
bold3 = bold3.filter(function(item, index, inputArray){return inputArray.indexOf(item)== index;});// Add HTML tags for Bold for(var i =0; i < bold3.length; i++){
body.replaceText(bold3[i],"<strong>"+ bold3[i]+"</strong>");}var idc = text.getTextAttributeIndices();var bold =[];var bold2 =[];var bold3 =[];// Identify where the Emphasized text is (indexes), and push it into arrays for(var i =0; i < idc.length; i++){if(text.isItalic(idc[i])&& idc[i+1]!== undefined){
bold.push(idc[i]);
bold2.push(idc[i+1]);}elseif(text.isItalic(idc[i])&& idc[i+1]== undefined){
bold.push(idc[i]);}}for(var i =0; i < bold.length; i++){if(bold2[i]!== undefined){
bold3.push(text.getText().slice(bold[i], bold2[i]));}else bold3.push(text.getText().slice(bold[i]));}
bold3 = bold3.filter(function(item, index, inputArray){return inputArray.indexOf(item)== index;});// Add HTML tags for Emphasized textfor(var i =0; i < bold3.length; i++){
body.replaceText(bold3[i],"<em>"+ bold3[i]+"</em>");}}
postUpdater.gs
functionpostUpdater(){var body = DocumentApp.getActiveDocument().getBody();var text = body.editAsText();var list2 = body.findElement(DocumentApp.ElementType.LIST_ITEM).getElement().editAsText();// Run the functions that prepare the text for HTML postFormatter();links();list();// Update the post content in the WordPress database (ID to be specified at line 22)var address ='[DATABASE SERVER]';var user ='[DATABASE USERNAME]';var userPwd ='[DATABASE PASSWORD]';var db ='[DATABASE NAME]';var instanceUrl ='jdbc:mysql://'+ address;var dbUrl = instanceUrl +'/'+ db;var bodt = body.getText();var conn = Jdbc.getConnection(dbUrl, user, userPwd);var SQLstatement = conn.createStatement();var result = SQLstatement.executeUpdate("UPDATE wp_posts SET post_content='"+ bodt +"' WHERE ID=[POST ID]");
SQLstatement.close();
conn.close();// Remove the tags added to the Google document, restoring it as it was before our functions ran
body.replaceText("<strong>","");
body.replaceText("</strong>","");
body.replaceText("<em>","");
body.replaceText("</em>","");
body.replaceText("<ul>","");
body.replaceText("</ul>","");
body.replaceText("<ol>","");
body.replaceText("</ol>","");
body.replaceText("<li>","");
body.replaceText("</li>","");
body.replaceText("<a href=''","");
body.replaceText("</a>","");
body.replaceText("''>","");
body.replaceText("''>","");var body = DocumentApp.getActiveDocument().getBody();var text = body.editAsText();var idc = text.getTextAttributeIndices();var bold3 =[];for(var i =0; i < idc.length; i++){if(text.getLinkUrl(idc[i])!=null){
bold3.push(text.getLinkUrl(idc[i]));}}for(var i =0; i < bold3.length; i++){
body.replaceText(bold3[i],"");}}
menu.gs
functiononOpen(e){// Create a menu on the Google document from which to launch the update
DocumentApp.getUi().createMenu('WordPress').addItem("Update post content","postUpdater").addToUi();}
Below is the code to automatically create a PDF from a Google document and send it via email; you find the instructions and code below and in the GitHub repository.