Jose Luis Silva, Ph.D.

0 %
  • PhD. in Physics
    UU πŸ‡ΈπŸ‡ͺ
  • Postdoc in AI:
    LiU πŸ‡ΈπŸ‡ͺ
  • Project:
    Aicavity Academy πŸ‡ΈπŸ‡ͺ
  • Nationality:
    Sweden/Brazil πŸ‡ΈπŸ‡ͺπŸ‡§πŸ‡·
  • Quants, Management, Data Science & Analytics
  • Materials, AI, ML & Engineering
  • LLMs, Graphs, Vision & NLP
  • Deep Learning & Reinforcement Learning

Natural Language Processing on Financial Statements

May 2, 2022

Natural Language Processing on Financial Statements

Project Description:

NLP Analysis on 10-k financial statements to generate an alpha factor. For the dataset, we’ll be using the end of day from Quotemedia and Loughran-McDonald sentiment word lists.

Artificial Intelligence for Trading

– My Certificate –

10-Ks (Steps)

  • The function get_documents extracts the documents from the text.

  • Alternative implementation:

  • return extracted_docs = re.compile(‘(.*?)’, re.DOTALL | re.IGNORECASE).findall(text)

  • The function get_document_type returns the document type lowercased.

  • Similar implementation:

return re.findall("<TYPE>(.*?)\n", doc)[0].lower()

Preprocess the Data

  • The function lemmatize_words lemmatizes verbs.
  • Similar implementation: but thought you should know you can write in this way also.
wordnet_lemmatizer = WordNetLemmatizer()
return [wordnet_lemmatizer.lemmatize(word, pos=wordnet.VERB) for word in words]

Analysis on 10ks

  • The function get_bag_of_words generates a bag of words from documents.

  • Just 2 Line Implementation ❀️

  • Alternative Implementation : Using Lambda Function

return np.array([sentiment_words.apply(lambda x: int(x in doc)) for doc in docs])
  • The function get_jaccard_similarity calculates the jaccard similarities for neighboring documents.

  • Alternative Implementation:

return [jaccard_similarity_score(bag_of_words_matrix[i]>0,bag_of_words_matrix[i+1]>0) \
        for i in range(len(bag_of_words_matrix)-1)]
  • The function tfidf generate TFIDF vectors for each document.

  • The function get_cosine_similarity calculates the cosine similarities for each neighboring TFIDF vector/document.

  • Alternative Implementation:

    return [float(cosine_similarity(tfidf_matrix[i, :].reshape(1, -1), tfidf_matrix[i + 1, :].reshape(1, -1)))
            for i in range(tfidf_matrix.shape[0] - 1)]
Posted in Artificial Intelligence, Financial Markets, Machine LearningTags:
Write a comment