Tagging customer queries via Natural Language Processing (NLP) for incredibly fast query resolution

  • Modeling: Data Exploration, Model Architecture (Word Vectorization, CNNs, Hybrid Neural Networks), Thumb Rules
  • Deployment: Simple Salesforce API, Salesforce Workbench, Data query and updating through Salesforce API


Just like any modeling exercise, we went about improving the model through various iterations of hyperparameter tuning. The diagram below should give a view of the process:

  • High number of labels: There are 40+ subcategories to which a case can be tagged. Additionally, these labels can have overlapping meanings which can lead to ambiguity in training. We resolved this by clubbing labels with similar meanings. . Eliminating ambiguity to the modeling input was crucial to higher accuracy.
  • HTML tags, revert snippets in email texts: Extensive data cleaning was done to remove special characters and unwanted text snippets.
  • Incorrect spellings, mixed language: This prompted us to adopt an approach that uses context modeling instead of keyword-based modeling. We will cover this in more detail later.
Words most similar to ‘tournament’
Words most similar to ‘rummy’
from gensim.models.word2vec import Word2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn import utils# Create unique tags for each document
def labelize(text,label):
result = []
prefix = label
for i, t in zip(text.index, text):
[prefix + '_%s' % i]))
return resultdef CBOW_model(textseries, epochs=30, size=100):
all_x = textseries
all_x_w2v = labelize(all_x, 'all')# Define model parameters
model_cbow = Word2Vec(sg=0, size=100,
negative=5, window=2,
model_cbow.build_vocab([x.words for x in all_x_w2v])
for epoch in range(epochs):
model_cbow.train(utils.shuffle([x.words for x
in all_x_w2v]),
total_examples=len(all_x_w2v), epochs=1)
return model_cbow
model_cbow = CBOW_model(df['text'], epochs=30, size=100)
# NUM_WORDS: number of words retained in corpusdef create_embedding_matrix(model_cbow,text_column, NUM_WORDS):
embeddings_index = {}

# Create dictionary: Word to Vector Mapping
# key = vocabulary words. # value = word vector
for w in model_cbow.wv.vocab.keys():
embeddings_index[w] = model_cbow.wv[w]
# List containing no of words in each case
length = []
for x in text_column:

# Word to Vector Mapping from dictionary to matrix
tokenizer = build_tokenizer(text_column, NUM_WORDS)
embedding_matrix = np.zeros((NUM_WORDS, 100))
for word, i in tokenizer.word_index.items():
if i >= num_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

return embedding_matrix
embedding_matrix = create_embedding_matrix(model_cbow,
df['text'], NUM_WORDS)
from keras.models import Sequential
from keras.layers import Conv1D, GlobalAveragePooling1D, Embedding
from keras.layers import concatenate, Model, layers
# NUM_WORDS: number of words retained in corpus
# EMBEDDING_DIM: dimensions of embedding matrix
# MAXLEN: length for a sequence:
# if length > MAXLEN, sequence is trimmed
# if length < MAXLEN, sequence is padded with 0s
# Define MLP network
def create_mlp():
model = Sequential()
input_dim=X_train_mlp.shape[1], activation="relu"))
model.add(Dense(50, activation="relu"))
return model
# Define CNN network
def create_cnn():
model = Sequential()
model.add(layers.Embedding(NUM_WORDS, EMBEDDING_DIM,
input_length=MAXLEN, trainable=True))
model.add(Conv1D(filters=128, kernel_size=2,
padding='valid', activation='relu', strides=1))
model.add(Dense(100, activation='relu'))
return modelmlp = create_mlp()
cnn = create_cnn()
# Combine input from MLP and CNN and connect to a new dense layer
combinedInput = concatenate([mlp.output, cnn.output])
x = Dense(50, activation="relu")(combinedInput)
x = Dense(Y_train_enc.shape[1], activation='softmax')(x)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)
optimizer='adam', metrics=['accuracy'])
model.fit([np.array(X_train_mlp), X_train_cnn_seq],Y_train_enc,
validation_data=([np.array(X_val_mlp), X_val_cnn_seq],
epochs=10, batch_size=32)


Once the model building was complete, the next step was deploying it on the Salesforce platform. Since Salesforce is a widely used tool, various APIs serving different purposes are available. We used the API SimpleSalesforce (SF) for Python (read the documentation here). This is a powerful API that enables you to query data from the Salesforce backend and perform update operations for records.

from simple_salesforce import Salesforce# Create Salesforce connection
sf = Salesforce(
username= USERNAME,
password= PASSWORD, security_token=TOKEN)
# Query data
sf_data = sf.query_all(query)
df = pd.DataFrame(sf_data['records']).drop(columns='attributes')
# Get model predictions in the desired output format
output_dicts = get_model_predictions()
# Update Salesforce data tables


  • The accuracy after deployment was similar to the testing and validation accuracies: 82% at Category and 74% at the subcategory level.
  • The weighted precision and recall for the trained model was 74% and 76% respectively. This is pretty decent for a textual dataset with more than 40 different labels and emails that can include vernacular language.
  • For datasets with a lower number of labels, techniques like Correlated topic modeling, Latent Dirichlet Allocation can give decent performance as well.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Junglee Games

Junglee Games


Junglee Games develops cutting-edge gaming technology and customized licensing solutions for desktop and mobile platforms.