.code{32}

Unlocking the creative potential of emoticons: How ChatGPT can transform simple smiley faces into engaging stories

Posted on February 23, 2023 / No Comments

During my most recent conversation with ChatGPT, I made the most amazing discovery: it turns out that the AI language model is able to interact with emoticons and generate stories based on them.

I experimented by sending a number of emoticons to ChatGPT, and the program turned them into a hilarious story that kept me on the edge of my seat the whole time. The creativity and inventiveness displayed by ChatGPT astounded me.

It is interesting to watch how ChatGPT can turn something as simple as an emoji into something as complicated and engaging as a story. If you haven’t tried it yet, I strongly advise you to do so. New creative ways of interacting with the model are discovered every day.

Happy story telling!

Here the full interaction:

AI/Machine Learning

The future of work: How AI and chatbots like GPT are reshaping the global economy

Posted on January 7, 2023 / No Comments

The advent of widespread use of AI promises to revolutionize not only the business world, but also our everyday life. As an illustration, consider the rise of conversational robots like GPT (Generative Pre-trained Transformer), which can carry on natural-sounding conversations and carry out duties like customer service and data collection.

There may be major shifts in the job market as a result of the widespread use of chatbots and other forms of AI technology. Changing job markets and the need for previously unneeded skills may result from the increased usage of AI and automation to replace people. On the other hand, it has the potential to enhance productivity and generate new employment opportunities.

Significant as well are the geopolitical ramifications of AI. As AI becomes more commonplace, governments who are able to develop and deploy these technologies more quickly may have an advantage in terms of economic growth and military capability. The potential for the misuse of AI in military or espionage situations is a big issue, and this could lead to an AI “arms race” between governments.

Adopting AI may also have far-reaching effects on society. There is concern that industries like customer service, which rely heavily on human connection, may see a decline as a result of rising use of automation and machine learning algorithms. Social harmony and interpersonal relationships may suffer as a result.

The overall economic and social effects of AI are intricate and varied. In order to avoid unintended repercussions, governments, organizations, and individuals must responsibly create and utilize AI.

Machine Learning/Predictive Modeling

My first Kaggle competition: Forest Cover Type Prediction

Posted on August 8, 2021 / No Comments

Below you can find the report and the code of my first attempt at a Kaggle competition. Feedback and recommendations are welcomed!

Competition selection

Link to the dataset: https://www.kaggle.com/c/forest-cover-type-prediction/data.

I selected this dataset because of the big number of variables to explore and use in my model, and because the data seemed very clean. I also liked the challenge of predicting a multi-categorical target feature. The competition has been around since 2015, meaning I was able to find a good amount of posts in the “Discussion” section of the competition, to learn and improve.

The description of the seven levels of the target feature are the following. In the dataset, however, are expressed in integers in the last column “Cover_Type”:

Spruce/Fir
Lodgepole Pine
Ponderosa Pine
Cottonwood/Willow
Aspen
Douglas-fir
Krummholz

The “train” set of the competition includes 15,120 observations, while the “test” set 565,892 observations. The descriptive features are, in order:

Elevation – Elevation in meters
Aspect – Aspect in degrees azimuth
Slope – Slope in degrees
Horizontal_Distance_To_Hydrology – Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology – Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways – Horz Dist to nearest roadway
Hillshade_9am (0 to 255 index) – Hillshade index at 9am, summer solstice
Hillshade_Noon (0 to 255 index) – Hillshade index at noon, summer solstice
Hillshade_3pm (0 to 255 index) – Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points – Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) – Wilderness area designation
Soil_Type (40 binary columns, 0 = absence or 1 = presence) – Soil Type designation
Cover_Type (7 types, integers 1 to 7) – Forest Cover Type designation

Data exploration

One of the first element I checked through visualisation was the balance of the dataset. I discovered that the dataset was perfectly balanced: the 7 levels of the target feature had the same number of observations across the “train” dataset.

Plotting the histogram of all continuous values, I saw that none of them followed a normal distribution.

With the help of visualisation, I discovered interesting relationships between “Aspect”, “Hillshade” columns (3), and the columns describing horizontal distances.

There were no missing values in the dataset, however I was concerned by some continuous features having a good number of “0”. I wrote a function that identified those features and then I read the dataset description on the competition Kaggle URL address. I decided that those “zeros” were legit and had to be left as they were (for example, “Slope”, “Aspect”, distance from hydrology or roadways). Instead of being error or missing values, were correct values.

As I did not encounter normal distributions, I decided to not investigate for outliers, as their presence could be beneficial to my model. The histograms visualised of continuous features, also, did not show to me concerning outliers.

The only two categorical features of the dataset, “Soil_Type” and “Wilderness_Area”, were already
*one-hot* encoded by the dataset provider in binary columns. Even if this type of encoding is ideal for the KNN algorithm, it aggravates the issue of dimensionality, especially for the “Soil_Type” category (consisting of 40 levels). For this reason, I decided to reverse this encoding by creating two new columns: “Soil_Type_All” and “Wilderness_Area_All”; I will test the two types of encodings in my iterations to assess what works best.

# I reverse the one-hot encoding of Soil_Type and Wilderness_Area, creating two new columns
soil_start = trees.columns.get_loc("Soil_Type1")
soil_end = trees.columns.get_loc("Soil_Type40")
trees.insert(soil_end+1,'Soil_Type_All', trees.iloc[:, soil_start:-1].idxmax(1))
area_start = trees.columns.get_loc("Wilderness_Area1")
area_end = trees.columns.get_loc("Wilderness_Area4")
trees.insert(area_end+1,'Wilderness_Area_All', trees.iloc[:, area_start:area_end+1].idxmax(1))

The model: K-nearest neighbour

Because of the high number of continuous features in the dataset, I decided to go with a similarity-based learning, in particular K-nearest neighbour.

As first step, I normalised the continuous features, as the KNN is very sensitive to not-normalised data.

I transformed the values in the newly created categorical columns in integers, in order to be utilised by my model. I was conscious this is not ideal, as it would imply an order between the levels of the categories; however, I still thought that it was better than train the model on too many features.

Before running the first iterations, for local evaluation purposes, I split the “train.csv” dataset in train (80%) and test (20%) data. In order to maintain the target feature balance, I used the parameter “stratify” of the “train_test_split” function.

Model iterations

Being a perfectly balanced dataset, I knew I could try high “k” values. My first iteration (“M1”) involved trying different values for “k”. I iterated through 3,5,6 and 8; and discovered that the best value of “k” was 5, as the model gave me an accuracy of 76%.

My second iteration involved trying a different distance metric than Euclidean: the Manhattan, as the latter is less influenced by single large differences in single features. The model accuracy was 74%, less than before. For this reason, I decided to use the Euclidean distance (default) for my following iterations.

For my third iteration, I decided to reduce the number of features on which to train the model, with the help of the SkLearn function “SelectKBest”. The function selected for me the top 8 features (out of 10), the accuracy of the third model (“M3”) was still just above 76%.

# I use the SelectKBest function to select the most 8 useful features for my model
selector = SelectKBest(chi2, k=8)
selector.fit(X_train, y_train)
cols = selector.get_support(indices=True)
X_train_new = X_train.iloc[:,cols]

sel_col = []
for col in X_train_new: 
    sel_col.append(X_test.columns.get_loc(X_train_new[col].name))

# KNN classifier built with the previous selected features only
mod3 = KNeighborsClassifier(n_neighbors=5)
mod3.fit(X_train_new, y_train)
y_pred = mod3.predict(X_test.iloc[:,sel_col])
print("M3 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

Before proceeding with the 4^th iteration, I tested the difference between feeding the model with the one-hot encoded features “Wilderness_Area” (4 binary columns), or use the created column “Wilderness_Area_All”, with integer encoding. The accuracy of the two KNN classifiers were exactly the same.

For my following iteration, I included the “Wilderness_Area_All” and “Soil_Type_All” features in the model training. This greatly improved my model (“M4”), that scored an accuracy of 80%.

My 5^th iteration involved changing the number of top features selected from 8 to 6. This proved to be successful, with an accuracy score for “M5” of 83%.

In my 6^th iteration, I ran the same model but “distance-weighted” KNN algorithm, by adding the parameter “weights”. The reasoning was based on the fact the closer neighbours should have more weight in the distances calculation. “M6” was the most successful model found, scoring an accuracy of 85%.

# I test a distance weighted KNN  
mod6 = KNeighborsClassifier(n_neighbors=5,weights='distance')
mod6.fit(X_train_new, y_train)
y_pred = mod6.predict(X_test.iloc[:,sel_col])
print("M6 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

After checking single-class accuracy, I noticed that for three classes I achieved an accuracy of over 95%, while for one my accuracy was only 67%. This is an interesting point for future investigations.

For my last iteration, I engineered a new feature “Hillshade_Avg”, created by averaging the three features “Hillshade_9am”, “Hillshade_Noon” and “Hillshade_3pm”. I then trained my model with it, instead of the original three features, but the accuracy of “M7” was exactly the same of “M6”.

The best performing KNN model found, at the end of all iterations, was “M6”.

Kaggle performance report

I ran my model on the Kaggle competition “test” dataset, and submitted my predictions. My score was 0.68217. This result was slightly disappointing, given the high accuracy achieved in my local evaluation of the model.

Future improvements

With more at my disposal, I would improve my model in the following ways:

Use of 10-fold cross-validation in my local evaluation strategy.
Deeper data exploration, by:
- better investigating features relationships and correlation.
- Investigate outliers, even if the distributions of continuous features are not-normal.
Features engineering: I would create new features that would describe the relationship between the original features, and use them to train my model.
Investigate low-accuracy classes predictions, by looking at the model accuracy by class.
Testing a decision tree algorithm, after binning the continuous descriptive features.
Group the 40 different “Soil_Types” in fewer categories.

Full code

import pandas as pd
import numpy as np
import matplotlib . pyplot as plt
import seaborn as sns
import math
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import confusion_matrix


# I import the training data, and check its main characteristics
trees = pd.read_csv("train.csv")
print(trees.shape)
trees = trees.set_index('Id')
print(trees.shape)
trees.dtypes
print(trees.describe())

# Check for missing values
trees.isnull().any()

# A funtion to check min and max of each feature
for column in trees:
    print(column + ': ' + str(trees[column].min()) + ' - ' + str(trees[column].max()))

# I reverse the one-hot encoding of Soil_Type and Wilderness_Area, creating two new columns
soil_start = trees.columns.get_loc("Soil_Type1")
soil_end = trees.columns.get_loc("Soil_Type40")
trees.insert(soil_end+1,'Soil_Type_All', trees.iloc[:, soil_start:-1].idxmax(1))
area_start = trees.columns.get_loc("Wilderness_Area1")
area_end = trees.columns.get_loc("Wilderness_Area4")
trees.insert(area_end+1,'Wilderness_Area_All', trees.iloc[:, area_start:area_end+1].idxmax(1))

# Visualise the target feature and the categorical feature Wilderness_Area
fig, ax = plt.subplots(1, 2, figsize=(17, 5))
ax[0].bar(trees['Cover_Type'].unique(), trees['Cover_Type'].value_counts())
ax[1].bar(trees['Wilderness_Area_All'].unique(), trees['Wilderness_Area_All'].value_counts())
plt.show()

# Visualise all continuous descriptive features
fig, ax = plt.subplots(2, 4,figsize=(17, 6))

ax[0,0].hist(trees.iloc[:,0])
ax[0,0].set_title(trees.columns[0])
ax[0,1].hist(trees.iloc[:,1])
ax[0,1].set_title(trees.columns[1])
ax[0,2].hist(trees.iloc[:,2])
ax[0,2].set_title(trees.columns[2])
ax[0,3].hist(trees.iloc[:,3])
ax[0,3].set_title(trees.columns[3])
ax[1,0].hist(trees.iloc[:,4])
ax[1,0].set_title(trees.columns[4])
ax[1,1].hist(trees.iloc[:,5])
ax[1,1].set_title(trees.columns[5])
ax[1,2].hist(trees.iloc[:,9])
ax[1,2].set_title(trees.columns[9])
ax[1,3].axis('off')

plt.tight_layout(h_pad=3)
plt.show()

# Visualise in detail Elevation and Aspect
fig, ax = plt.subplots(1, 2,figsize=(20, 8))

ax[0].hist(trees.iloc[:,0], bins=30)
ax[0].set_title(trees.columns[0])
ax[1].hist(trees.iloc[:,1], bins=30)
ax[1].set_title(trees.columns[1])

plt.show()

# Huge pairwise relationships plot of all continuous values; it may take time to load  
sns.set()
#sns_plot = sns.pairplot(trees.iloc[:,0:10])
#sns_plot.savefig("pairplot.png")

# Detail of interesting relationships, with detail target feature plotted with color
sns.pairplot(trees.iloc[:,[0,1,56]], hue='Cover_Type');
sns.pairplot(trees.iloc[:,[6,7,8,56]], hue='Cover_Type');

# Histogram of main continuous features with the Cover_Type dimension on color 
trees.pivot(columns="Cover_Type", values="Aspect").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Slope").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Elevation").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Hillshade_Noon").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Horizontal_Distance_To_Hydrology").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Vertical_Distance_To_Hydrology").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Horizontal_Distance_To_Roadways").plot.hist(bins=50)
plt.show()
trees.pivot(columns="Cover_Type", values="Horizontal_Distance_To_Fire_Points").plot.hist(bins=50)
plt.show()

# Boxplots of main continuous features by Cover_Type  
plt.clf()
sns.boxplot(x="Cover_Type", y="Slope", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Elevation", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Aspect", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Hillshade_Noon", data=trees)
plt.show()
plt.clf()
sns.boxplot(x="Cover_Type", y="Horizontal_Distance_To_Hydrology", data=trees)
plt.show()
plt.clf()

# Normalisation of continuous features 
min_max_scaler = preprocessing.MinMaxScaler()
trees.iloc[:,:10] = min_max_scaler.fit_transform(trees.iloc[:,:10])

# I transform the categorical values in integers
trees['Wilderness_Area_All'] = trees['Wilderness_Area_All'].str.replace('Wilderness_Area','').astype(int)
trees['Soil_Type_All'] = trees['Soil_Type_All'].str.replace('Soil_Type','').astype(int)

# I select the first 10 features, and split the dataset in training and test, for local evaluation purposes 
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,:10], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)

# I test various k values
# First KNN classifier with K = 3, Euclidean distance (default) 
mod1 = KNeighborsClassifier(n_neighbors=3)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)
print("M1 K3 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# KNN classifier with K = 5
mod1 = KNeighborsClassifier(n_neighbors=5)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)
print("M1 K5 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# First KNN classifier with K = 6 
mod1 = KNeighborsClassifier(n_neighbors=6)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)
print("M1 K6 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# First KNN classifier with K = 8
mod1 = KNeighborsClassifier(n_neighbors=8)
mod1.fit(X_train, y_train)
y_pred = mod1.predict(X_test)
print("M1 K8 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# I test a different type of distance metric
# KNN classifier with K = 5, manhattan distance
mod2 = KNeighborsClassifier(n_neighbors=5, metric="manhattan")
mod2.fit(X_train, y_train)
y_pred = mod2.predict(X_test)
print("M2 Accuracy score: " + str(accuracy_score(y_test, y_pred)))


# I use the SelectKBest function to select the most 8 useful features for my model
selector = SelectKBest(chi2, k=8)
selector.fit(X_train, y_train)
cols = selector.get_support(indices=True)
X_train_new = X_train.iloc[:,cols]

sel_col = []
for col in X_train_new: 
    sel_col.append(X_test.columns.get_loc(X_train_new[col].name))

# KNN classifier built with the previous selected features only
mod3 = KNeighborsClassifier(n_neighbors=5)
mod3.fit(X_train_new, y_train)
y_pred = mod3.predict(X_test.iloc[:,sel_col])
print("M3 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# Test to assess the impact of two different encodings for the categorical values 
# KNN with "Wilderness_Area" encoded in integers a single column
print(trees.columns.get_loc("Wilderness_Area_All"))
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,14]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)
modt = KNeighborsClassifier(n_neighbors=5)
modt.fit(X_train, y_train)
y_pred = modt.predict(X_test)
print("Test with 'integer' encoding accuracy score: " + str(accuracy_score(y_test, y_pred)))

# KNN with "Wilderness_Area" one-hot encoded, as it was originally in the dataset
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,10,11,12,13]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)
modt2 = KNeighborsClassifier(n_neighbors=5)
modt2.fit(X_train, y_train)
y_pred = modt2.predict(X_test)
print("Test with 'one-hot' encoding accuracy score: " + str(accuracy_score(y_test, y_pred)))


# For the next classifier, I include the "Wilderness_Area_All" and "Soil_Type_All" features
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,5,6,7,8,9,14,55]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)                                               
mod4 = KNeighborsClassifier(n_neighbors=5)
mod4.fit(X_train, y_train)
y_pred = mod4.predict(X_test)
print("M4 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# I repeat the process of automatic feature selection, but this time I select 6 instead of 8 top features
selector = SelectKBest(chi2, k=6)
selector.fit(X_train, y_train)
cols = selector.get_support(indices=True)
X_train_new = X_train.iloc[:,cols]

sel_col = []
for col in X_train_new: 
    sel_col.append(X_test.columns.get_loc(X_train_new[col].name))

mod5 = KNeighborsClassifier(n_neighbors=5)
mod5.fit(X_train_new, y_train)
y_pred = mod5.predict(X_test.iloc[:,sel_col])
print("M5 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# I test a distance weighted KNN  
mod6 = KNeighborsClassifier(n_neighbors=5,weights='distance')
mod6.fit(X_train_new, y_train)
y_pred = mod6.predict(X_test.iloc[:,sel_col])
print("M6 Accuracy score: " + str(accuracy_score(y_test, y_pred)))

# Check single class accuracy 
m = confusion_matrix(y_test, y_pred)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print(cm.diagonal())

# I create a new "Hillshade" column, by averaging "Hillshade_9am", "Hillshade_Noon" and "Hillshade_3pm"
shade_start = trees.columns.get_loc("Hillshade_9am")
shade_end = trees.columns.get_loc("Hillshade_3pm")
trees.insert(shade_end+1,'Hillshade_Avg', (trees.iloc[:,shade_start] + trees.iloc[:,shade_start+1] + trees.iloc[:,shade_end])/3)

# KNN classifier that includes the newly created feature   
X_train, X_test, y_train, y_test = train_test_split(trees.iloc[:,[0,1,2,3,4,5,9,10,15,56]], trees.iloc[:,-1], stratify=trees.iloc[:,-1], test_size=0.20, random_state=1)

selector2 = SelectKBest(chi2, k=6)
selector2.fit(X_train, y_train)
cols2 = selector2.get_support(indices=True)
X_train_new2 = X_train.iloc[:,cols2]

sel_col2 = []
for col in X_train_new2: 
    sel_col2.append(X_test.columns.get_loc(X_train_new2[col].name))
    
mod7 = KNeighborsClassifier(n_neighbors=5,weights='distance')
mod7.fit(X_train_new2, y_train)
y_pred = mod7.predict(X_test.iloc[:,sel_col2])
print("M7 Accuracy score: " + str(accuracy_score(y_test, y_pred)))


# Preparation of the Kaggle test dataset, I make the same transformation made to the train dataset
kaggle_test = pd.read_csv("test.csv")
kaggle_test = kaggle_test.set_index('Id')
soil_start = kaggle_test.columns.get_loc("Soil_Type1")
soil_end = kaggle_test.columns.get_loc("Soil_Type40")
kaggle_test.insert(soil_end+1,'Soil_Type_All', kaggle_test.iloc[:, soil_start:-1].idxmax(1))
area_start = kaggle_test.columns.get_loc("Wilderness_Area1")
area_end = kaggle_test.columns.get_loc("Wilderness_Area4")
kaggle_test.insert(area_end+1,'Wilderness_Area_All', kaggle_test.iloc[:, area_start:area_end+1].idxmax(1))
kaggle_test.iloc[:,:10] = min_max_scaler.fit_transform(kaggle_test.iloc[:,:10])
kaggle_test['Wilderness_Area_All'] = kaggle_test['Wilderness_Area_All'].str.replace('Wilderness_Area','').astype(int)
kaggle_test['Soil_Type_All'] = kaggle_test['Soil_Type_All'].str.replace('Soil_Type','').astype(int)

# Dataset split and target feature prediction with the best model: "M6"
kaggle_test_sel = kaggle_test.iloc[:,[0,1,2,3,4,5,6,7,8,9,14,55]]

kaggle_pred = mod6.predict(kaggle_test_sel.iloc[:,sel_col])

# Export to CSV for submission
submission = {'Id': kaggle_test.index,
        'Cover_Type': kaggle_pred
        }

subm = pd.DataFrame(submission, columns = ['Id', 'Cover_Type'])

subm.to_csv(r'xxx.csv', index = True)

Cybersecurity/Digital Marketing

LinkedIn Ads allows anyone to sponsor the published posts of any company page

Posted on October 23, 2020 / No Comments

Can you imagine if you could sponsor the published content of any company page on LinkedIn? Well.. you can.

I discovered this, by chance, exploring the security features of the LinkedIn Ads platform. As a first warning, I discovered that I could create an account connected to any company page on the platform. For my test, I created a new ads account connected to the Google LinkedIn company page.

Not a big deal, I thought, as connecting the page would not necessarily mean that I would be automatically authorized to advertise on its behalf. I proceeded by filling out the details of the campaigns. When I reached the ad selection page, I couldn’t believe my eyes.

The option “Create new ad” was greyed out, however I could select the ad creative from a list of already published posts by the Google LinkedIn company page. I selected this post from the list.

I launched the campaign, and it worked! I was now advertising on behalf of Google, sponsoring one of their posts. The campaign started accruing some impressions and clicks, at which point I stopped it.

In order to test that this was not an isolated case, I replicated the test by creating a new account connected, this time, to the Prada Group company page. For this experiment I selected a “seasoned” post (2 years of age) from their published content list. And yet again, the campaign started smoothly.

It can be argued that there is no real issue here, as it would be impossible to damage a company by sponsoring its own content. If this is partly true, it is also the case that seeing very old content could confuse the users. Imagine, in particular, a situation where a company changes its standpoint in time on a certain matter, and a potential malicious actor sponsors one of its old posts, in which the company stated the opposite of its own most recent viewpoint. I can’t see how this wouldn’t cause potential havoc.

I immediately notified LinkedIn of this and I am currently waiting for an answer.

Did anyone of you ever noticed the same?

Do you think this happens by design, or did I just found a bug?

Could you think of other possible malicious exploitation scenarios?

Predictive Modeling/Web Scraping

Python script that scrapes and calculates the slope (linear regression) of 2-year Google Trends data

Posted on August 20, 2019 / No Comments

Google Trends is a great tool for checking relative search volumes trends for keywords or topics, and to compare them with each other. The tool allows you to download the main trend graph points in a CSV file.

Some time ago I was practicing calculating linear regression with Excel on Google Trends sets of 2-year data. I chose a time-range of 2 years to account for possible seasonality. I wondered if I could automate this process with a script and output a single metric (the slope value), as an indication of the search trend of the past 2 years. A negative number would mean that searches are decreasing, while a positive number means that the trend is positive.

After a few searches I became aware of Scipy, a Python open-source library for scientific and technical computing that includes mathematical functions, including regressions.

I started this project by analyzing how Google Trends constructs its web requests in the browser with the Google Chrome “Network” tool, and then I transferred the learnings in a Scrapy script (my favorite scraping library).

The result is the script shared with you at the end of this article; it can be called server-side with the following command:

scrapy runspider -L WARNING filename.py

In order for the script to work, both Scrapy and Scipy libraries should be installed on your server, and imported at the beginning of the script, along with the “datetime” and “json” modules. The code is ready to be used, just make sure to substitute the “keyword” value with the term you want to query trends for. The default data points range is set in weeks, and the slope is calculated by pairing the numbers from 1 to 104 (the numbers of weeks) with the value for each week extracted.

I hope you will make good use of the script for your purposes, and please don’t hesitate to ask me if you have any doubts about its functions.

import scrapy
import json
from datetime import date, timedelta
from scipy.stats import linregress

# Calculate the current date and the same day of 2 years before, in the format YYYY-MM-DD
today = str(date.today())
mo = date.today().strftime('%m')
da = date.today().strftime('%d')
ye = date.today().year
old = str(ye-2)+'-'+mo+'-'+da
today = str(ye)+'-'+mo+'-'+da

# We will query 2 years of data, therefore 104 weeks 
weeks = list(range(1,105))

# Customize this with the keyword you want to query trends for
keyword = 'bitcoin'

# URLs constructors 
url1 = 'https://trends.google.com/trends/api/explore?hl=en-US&tz=-60&req={"comparisonItem":[{"keyword":"'
url2 = '","geo":"IE","time":"'+old+' '+today+'"}],"category":0,"property":""}&tz=-60'
url3 = ('https://trends.google.com/trends/api/widgetdata/multiline?hl=en-US&tz=-60&req={"time":"'+old+' '+today+
		'","resolution":"WEEK","locale":"en-US","comparisonItem":[{"geo":{"country":"IE"},"complexKeywordsRestriction":'+
		'{"keyword":[{"type":"BROAD","value":"')
url4 = '"}]}}],"requestOptions":{"property":"","backend":"IZG","category":0}}&token='
url5 = '&tz=-60'

# Scrapy scraper
class TrendsSpider(scrapy.Spider):
	name = "trends2"
	handle_httpstatus_list = [429, 500]
	
	def start_requests(self):
		# First we crawl the Google Trends root domain, in order to receive a valid session cookie
		url = 'https://trends.google.com/'
		yield scrapy.Request(url, callback=self.parse,headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'})
	
	def parse(self, response):
		# The request that will return a JSON file with the tokens for each widget
		newurl = url1+keyword+url2
		return scrapy.Request(newurl,callback=self.parse_1)
		
	def parse_1(self, response):
		# We have to delete the first 4 characters of the response as not relevant 
		jsonresponse = json.loads(response.text[5:])
		# We parse the JSON response, capture the token for the "Interest over time" widget and construct the new URL with it
		kwd = jsonresponse['widgets'][0]['request']['comparisonItem'][0]['complexKeywordsRestriction']['keyword'][0]['value']
		tok= jsonresponse['widgets'][0]['token']
		finalurl = (url3 + kwd + url4 + tok + url5)
		# We pass the keyword to the next parser function
		return scrapy.Request(finalurl,callback=self.parse_2,meta={'keyword':kwd})
		
	def parse_2(self, response):
		jsonresponse2 = json.loads(response.text[6:])
		# We insert the weekly values in an array
		values = []
		for i in jsonresponse2['default']['timelineData']:    
			values.extend(i['value'])
		if 	len(values)==0:
			slope = 0
		else:
			# The following function returns the slope of the calculated linear regression by combining the weeks numbers (1-104) with the values
			slope = "%.2f" % linregress(weeks, values).slope
		# Print the keyword with the slope value 
		print(response.meta['keyword']+': '+slope)

Web Development/Web Scraping

Finema, the Dublin cinema listing aggregator

Posted on August 15, 2019 / No Comments

In recent times, if you want to check what’s on in the cinema in your city, your only option was to sort through the online listings by movie or by cinema. My girlfriend told me that newspapers – years ago – used to list movies by hour, and this sorting method was extremely handy to consult on the same day.

As at the time I was practicing web scraping with Scrapy I decided to give it a go, fill this gap in the online cinema listing space and create a website that she and other people can use, with the look & feel of an old newspaper.

I created Finema by writing a very simple page of code that includes PHP, HTML and JavaScript together. The PHP code queries and fetches in real-time a JSON file from my personal server with the cinema listings data. The JSON file on the server is generated by a developed Scrapy script that captures the daily movies timings from a popular Irish online entertainment website every few hours.

All movies of the selected cinemas are listed by hour, starting from the first movie of the day and finishing with the last one. By clicking on the cinema name, it is possible to visualize the daily shows in that particular cinema only, while the drop-down allows to select a specific movie. The two filters can be combined.

The risk of this type of aggregators lies in the source of the scraped data: if the website from which the data is taken changes its HTML code, the scraping script can break and return corrupted or no data. This is why it is important to put in place regular automatic checks and notifications, in order to immediately amend your scripts as the breaking happens.

Also, if you are thinking to create a similar project, remember to check the Terms & Conditions of the website from which you are capturing the data, as they may forbid scraping of their content.

Would a similar website be useful in your city, too? Write me and we can make it happen!

Google Apps Script

Update a WordPress post content directly from Google Doc with App Script

Posted on June 28, 2019 / No Comments

Developing this was a lot of fun.

Using the JDBC service, it is possible to update a WordPress post content from a Google document text body.

I have posted below all the code and the instructions; you can find the same in the GitHub repository also.

Open a new Google document
Click on ‘Tools > Script editor’
Copy the code in this repository in new scripts files inside your new App Script project
In the postUpdater.gs code, substitute the text in square brackets at lines 13, 14, 15 and 16 with your WordPress database credentials
In the same file at line 22, substitute [POST ID] with the ID of the WordPress post you want to update
Save all files in the App Script project and close it
Refresh the document page, update its content and click on the menu to ‘WordPress > Update post content’
Your WordPress post content will be automatically updated with the content of your Google document!

Note: The script will automatically append and remove HTML tags to your document, in order to transfer formatting styles, lists and links.

links.gs

function links() {
 
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var idc = text.getTextAttributeIndices(); 
var bold = [];
var bold2 = [];
var bold3 = [];
var bold4 = [];
var bold5 = [];
var bold6 = [];

// Retrieve the exact indices where links are, and save it in arrays together with the links URLs 
  for (var i = 0; i < idc.length; i++) {
    if (text.getLinkUrl(idc[i]) != null) {
       bold.push(idc[i]);
       bold2.push(idc[i+1]);
       bold3.push(text.getLinkUrl(idc[i]));
       bold4.push(text.getLinkUrl(idc[i]).length);
       var sum = bold4.reduce(function(a, b) { return a + b; }, 0);
       bold5.push(sum);
    } 
  }

// Insert HTML for links
  text.insertText(bold[0], "<a href=''" + bold3[0] + "''>");
  text.insertText(bold2[0]+13+bold3[0].length, "</a>");
  
  for (var i = 1; i < bold.length; i++) {
    if (bold2[i] != undefined) {
    text.insertText(bold[i]+(17*i)+(bold5[i-1]), "<a href=''" + bold3[i] + "''>");
    text.insertText(bold2[i]+17+bold3[0].length+13+bold3[i].length, "</a>");
    }
    else {
    text.insertText(bold[i]+(17*i)+(bold5[i-1]), "<a href=''" + bold3[i] + "''>");
    var eh = body.getText();
    text.insertText(eh.length, "</a>");
    }
  }
}

list.gs

function list() {
  
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var list = body.findElement(DocumentApp.ElementType.LIST_ITEM).getElement().editAsText();

var bold = [];

// Append HTML list tags to the lists elements   
for (var i = 0; i < DocumentApp.getActiveDocument().getNumChildren(); i++) {
var firstChild = DocumentApp.getActiveDocument().getBody().getChild(i);
if (firstChild.getType() == DocumentApp.ElementType.LIST_ITEM) {
    firstChild.replaceText(firstChild.getText(), "<li>" + firstChild.getText() + "</li>");
    var childIndex = firstChild.getListId();
    bold.push(i);
 }
}

// Append HTML main <ul> or <ol> lists tags to the lists 
 for (var i = 0; i < bold.length; i++) {
 if (bold[i-1] != bold[i]-1) {
    var firstChild = body.getChild(bold[i]);
    if (firstChild.getGlyphType() != DocumentApp.GlyphType.NUMBER) {
    firstChild.replaceText(firstChild.getText(), "<ul>" + firstChild.getText());
    }
    else {
    firstChild.replaceText(firstChild.getText(), "<ol>" + firstChild.getText());
    }
}}

 for (var i = 0; i < bold.length; i++) {
 if (bold[i+1] != bold[i]+1) {
    Logger.log(i); 
    var firstChild = body.getChild(bold[i]);
    if (firstChild.getGlyphType() != DocumentApp.GlyphType.NUMBER) {
    firstChild.appendText("</ul>");
    }
    else {
    firstChild.appendText("</ol>");
    }
}}
     
}

postFormatter.gs

function postFormatter() {

// This script adds <strong> and <em> tags to the text, preparing it to be uploaded in the WordPress database    
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var idc = text.getTextAttributeIndices(); 

var bold = [];
var bold2 = [];
var bold3 = [];

// Identify where the Bold text is (indexes), and push it into arrays  
  for (var i = 0; i < idc.length; i++) {
    if (text.isBold(idc[i]) && idc[i+1] !== undefined) {
       bold.push(idc[i]);
       bold2.push(idc[i+1]);
    } 
    else if (text.isBold(idc[i]) && idc[i+1] == undefined) {
       bold.push(idc[i]);
    }
  }
  
   for (var i = 0; i < bold.length; i++) {
       if (bold2[i] !== undefined) {
       bold3.push(text.getText().slice(bold[i], bold2[i]));
       }
       else bold3.push(text.getText().slice(bold[i]));
   } 
bold3 = bold3.filter( function( item, index, inputArray ) {
  return inputArray.indexOf(item) == index;
});
  
// Add HTML tags for Bold  
  for (var i = 0; i < bold3.length; i++) {
       body.replaceText(bold3[i], "<strong>" + bold3[i] + "</strong>");
  } 


var idc = text.getTextAttributeIndices(); 
var bold = [];
var bold2 = [];
var bold3 = [];

// Identify where the Emphasized text is (indexes), and push it into arrays  
  for (var i = 0; i < idc.length; i++) {
    if (text.isItalic(idc[i]) && idc[i+1] !== undefined) {
       bold.push(idc[i]);
       bold2.push(idc[i+1]);
    } 
    else if (text.isItalic(idc[i]) && idc[i+1] == undefined) {
       bold.push(idc[i]);
    }
  }
  
   for (var i = 0; i < bold.length; i++) {
       if (bold2[i] !== undefined) {
       bold3.push(text.getText().slice(bold[i], bold2[i]));
       }
       else bold3.push(text.getText().slice(bold[i]));
   } 
bold3 = bold3.filter( function( item, index, inputArray ) {
  return inputArray.indexOf(item) == index;
});

// Add HTML tags for Emphasized text
  for (var i = 0; i < bold3.length; i++) {
       body.replaceText(bold3[i], "<em>" + bold3[i] + "</em>");
  } 
  
}

postUpdater.gs

function postUpdater() {

var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var list2 = body.findElement(DocumentApp.ElementType.LIST_ITEM).getElement().editAsText();

// Run the functions that prepare the text for HTML  
  postFormatter();
  links();
  list();

// Update the post content in the WordPress database (ID to be specified at line 22)
var address = '[DATABASE SERVER]';
var user = '[DATABASE USERNAME]';
var userPwd = '[DATABASE PASSWORD]';
var db = '[DATABASE NAME]';
var instanceUrl = 'jdbc:mysql://' + address;
var dbUrl = instanceUrl + '/' + db;
var bodt = body.getText();
var conn = Jdbc.getConnection(dbUrl, user, userPwd);
var SQLstatement = conn.createStatement();
var result = SQLstatement.executeUpdate("UPDATE wp_posts SET post_content='" + bodt + "' WHERE ID=[POST ID]");
SQLstatement.close();
conn.close();

// Remove the tags added to the Google document, restoring it as it was before our functions ran   
body.replaceText("<strong>", "");
body.replaceText("</strong>", "");
body.replaceText("<em>", "");
body.replaceText("</em>", "");
body.replaceText("<ul>", "");
body.replaceText("</ul>", "");
body.replaceText("<ol>", "");
body.replaceText("</ol>", "");
body.replaceText("<li>", "");
body.replaceText("</li>", "");
body.replaceText("<a href=''", "");
body.replaceText("</a>", "");
body.replaceText("''>", "");
body.replaceText("''>", "");
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var idc = text.getTextAttributeIndices(); 

var bold3 = [];
for (var i = 0; i < idc.length; i++) {
    if (text.getLinkUrl(idc[i]) != null) {  
       bold3.push(text.getLinkUrl(idc[i]));
    } 
}
for (var i = 0; i < bold3.length; i++) {
       body.replaceText(bold3[i], "");
}

}

menu.gs

function onOpen(e) {

// Create a menu on the Google document from which to launch the update
   DocumentApp.getUi()
       .createMenu('WordPress')
       .addItem("Update post content", "postUpdater")
       .addToUi();
 
 }

Google Apps Script

Automatically send a Google document as PDF attachment by email with Apps Script

Posted on May 10, 2019 / 4 Comments

Few years ago I had a lot of fun with Google App Scripts.

Below is the code to automatically create a PDF from a Google document and send it via email; you find the instructions and code below and in the GitHub repository.

Open a new Google document
Click on ‘Tools > Script editor’
Copy the code below in new scripts files inside your new App Script project
In the pdf.gs code, substitute:
- [INSERT EMAIL] at line 12 with the recipient / recipients of the email (if multiple, separate emails with comma)
- [INSERT IMAGE URL] at line 13 with the URL of the image you want to include in the email body after text
- [INSERT EMAIL SUBJECT] at line 16 with the desired email subject
- The text of the string at line 17 with your own text; this will be the email content, the HTML tag serves as line break
- [EMAIL SENDER VISUALIZED NAME] at line 20 with the desired sender’s name
- “Hello” at line 26 with the desired fallback body text, for devices incapable of rendering HTML
Save all files in the App Script project and close it
Refresh the document page, update its content and click on the menu to ‘Email > Send PDF’
Your document will automatically be exported in PDF and sent to the recipients specified in pdf.gs.

pdf.gs

function pdf() {
  
  DocumentApp.getActiveDocument().saveAndClose();
  
// Create a PDF from the Google document
  var doc = DocumentApp.getActiveDocument();
  var ui = DocumentApp.getUi();
  var docblob = DocumentApp.getActiveDocument().getAs('application/pdf');
  docblob.setName(doc.getName() + ".pdf");

// Prepare and send the email from the connected Gmail account 
  var recipients = "[INSERT EMAIL]";
  var ds = "[INSERT IMAGE URL]";
  var ds2 = UrlFetchApp.fetch(ds).getBlob();
                            
  var subject = "[INSERT EMAIL SUBJECT]";
  var string = "Hi Mark,<br><br> Please find attached the requested pdf.<br><br>Thanks,<br>Federico";
           
   var options = {};
        options.name = "[EMAIL SENDER VISUALIZED NAME]";
        options.replyTo = recipients;
        options.htmlBody = string + "<br><br><img src='cid:firma'>";
        options.inlineImages = {firma:ds2};
        options.attachments = [docblob];
    
   MailApp.sendEmail(recipients, subject, "Hello", options);
   
}

menu.gs

function onOpen(e) {

// Create the menu item in the Google document
   DocumentApp.getUi()
       .createMenu('Email')
       .addItem('Send PDF', 'pdf')
       .addToUi();
 }

Press

Intro

Posted on March 23, 2019 / No Comments

Hi everybody!

It’s long time I think of posting all my coding adventures (and more) on a blog.

It’s finally happening.

I hope you enjoy it and find it useful.