Copy editing automation - Digiscape

_config.yml

So, there is a journal and research paper publication company Elsevier. They receive a copy of journals from authors and these journals have to go through various stages of copy editing tasks before getting published. Now, all these tasks were done manually previously and we were to automate some of the steps which could reduce the processing time, error and manually. 

The problem was divided into different stages:-

  1. The research paper/journal is to be approved or not:-
    • This is based on hard-coded rules about (100) and Intensive regex and python programming is used to solve it. (with the help of an SME)
    • Classifying type of journal (i.e maths, chemistry, medical) using different ML models.
    • Classifying type of language (British English, American English).
  2. Data: Set of labelled documents/research papers (1000’s articles already processed manually) in XML format which helps us to pre-process chemical names/mathematical formulas into text using different libraries like gensim & nltk or by tag names. Like mathematical formulas, XML typically follows the same tags.

Explaining the tasks

Classifying type of journal (i.e maths, chemistry, medical) -

  • Real world/Business Objectives and Constraints : 
    • The cost of misclassification can be very high.
    • Want the probability of the classification task so, that we can set our own threshold.
    • No strict latency concerns.
    • Interpretability is partially important.
  • Mapping the real-world problem to an ML problem
    • Type of Machine Learning Problem: multiclass classification problem
    • Performance Metric
      • Multi Log-loss(minimize to 0 or max is infinity).
      • Confusion matrix and f1 score, true positive rate.
    • Distribution of data points among output classes was not balanced therefore apply data imputation & weights while classifying
    • Feature cleaning(using various NLP tasks) & made more features using different techniques
    • Interpretability

Data: In XML format, parse using beautiful soup, and extract features from it

Some features

  • Text (pre-processed)
    • Remove stopwords
    • Remove any tags
    • Lower-case
    • Remove new lines
    • Keep only text, numbers and formulas (as XML will convert it into text)
    • Remove other special char.
    • Decontractions I’ll to I will
    • Remove person name using chunking(type person Noun), tree (new york to new_york)
  • Author names: styles of given name / only initials /  full names 
  • Address: corresponding address, institution address, private. (sometimes institutes play a very important role to tell what type of journal) 
  • Images involved or not/what type of images
  • Chemicals names(involved or not) - chemdata extractor

 

Data processing 
  • Categorical + text(tf-idf) bi, uni grams. Tf-IDF with word vectors i.e tf-idf* trained weights
Training
  • Hyperparameters that give the highest AUC value
  • K fold, grid search, random search
  • Split data (stratified sampling)
  • KPI, to reduce avg. LOG LOSS (0, INF) because it penalises false classification as the log is a monotonically increasing function
  • Models multiclass
    • Base model: multinomial Naive Bayes log-loss 0.6 (place smoothing adding alpha to unknown words)
    • Knn (doesn’t work because of the curse of dimensionality and no interpretability)
    • Logistic reg. (works well with high dim.)  log-loss 0.37
    • Logistic reg. (class balance) 0.34
    • Linear SVM(because of large dimension, interpretable) 0.37
    • DT (high dimensionality failed) overfit
    • RF gave the same results as SVM
    • Stacking(failed because of fewer data) [RF, LR, SVM] then apply LR - 0.4
    • Voting classifier(RF,LR,SVM) - 0.3
    • LSTM(AUC = 0.8) - log loss - 0.28
    • Remove high and low IDF values from the data
*But because of interpretability, we have to go with logistic regression 

 Classifying type of language (British English, American English)

  • The text data is a combination of N-gram features, Term Frequency-Inverse Document Frequency (TF-IDF) weighting, In the TF-IDF weighting process, a threshold of 2,0 in the Document-Frequency (DF) is given. The classification process is carried out using Support Vector Machine (SVM) algorithm with a linear kernel and the best accuracy obtained is 90 %.
Written on September 10, 2022