text-mining

How do I extract the text of a single page with PyPDF2?

How do I extract the text of a single page with PyPDF2? Question: I have a document library which consists of several hundred PDF Documents. I am attempting to export the first page of each PDF document. Below is my script which extracts the page. It saves each page as an individual PDF. However, the …

Total answers: 2

Is there a way in python to extract only the CORE TEXT (without boxes, footer etc.) from a pdf?

Is there a way in python to extract only the CORE TEXT (without boxes, footer etc.) from a pdf? Question: I am trying to extract only the core text from a "rich" pdf document, meaning that it has a lot of tables, graphs, boxes, footers etc. in which I am not interested in. I tried …

Total answers: 2

Python regex – Extract all the matching text between two patterns

Python regex – Extract all the matching text between two patterns Question: I want to extract all the text in the bullet points numbered as 1.1, 1.2, 1.3 etc. Sometimes the bullet points can have space like 1. 1, 1. 2, 1 .3, 1 . 4 Sample text text = "some text before pattern 1.1 …

Total answers: 1

remove extra words from text

remove extra words from text Question: ive been trying to remove extra words like {‘by’,’the’,’and’,’of’ ,’a’} from text so my best way to do it is like this . Code : def clean_text(text): """ takes the text and removes signs and some words """ stopwords = {‘by’,’the’,’and’,’of’ ,’a’} result = [word for word in re.split("W+",text) …

Total answers: 2

Pandas find multiple words from a list and assign Boolean value if found

Pandas find multiple words from a list and assign Boolean value if found Question: So, I have dataframe like this, data = { "properties": ["FinancialOffice","Gas Station", "Office", "K-12 School", "Commercial, Office"], } df = pd.DataFrame(data) This is my list, proplist = ["Office","Other – Mall","Gym"] what I am trying to do is using the list I …

Total answers: 3

How do I add the result of a print function ito a list

How do I add the result of a print function ito a list Question: I have the follwing def what ends with a print function: from nltk.corpus import words nltk.download(‘words’) correct_spellings = words.words() from nltk.metrics.distance import jaccard_distance from nltk.util import ngrams from nltk.metrics.distance import edit_distance def answer_nine(entries=[‘cormulent’, ‘incendenece’, ‘validrate’]): for entry in entries: temp = …

Total answers: 2

Extracting dates that are in different formats using regex and sorting them – pandas

Extracting dates that are in different formats using regex and sorting them – pandas Question: I am new to text mining and I need to extract the dates from a *.txt file and sort them. The dates are in between the sentences ( each line) and their format can potentially be as follows: 04/20/2009; 04/20/09; …

Total answers: 1

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_' Question: I try to apply this code : pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression()) param_grid = {‘logisticregression__C’: [ 0.001, 0.01, 0.1, 1, 10, 100], “tfidfvectorizer__ngram_range”: [(1, 1),(1, 2),(1, 3)]} grid = GridSearchCV(pipe, param_grid, cv=5) grid.fit(text_train, Y_train) scores = grid.cv_results_[‘mean_test_score’].reshape(-1, 3).T # visualize heat map heatmap = mglearn.tools.heatmap( scores, xlabel=”C”, ylabel=”ngram_range”, …

Total answers: 4

Hashingvectorizer and Multinomial naive bayes are not working together

Hashingvectorizer and Multinomial naive bayes are not working together Question: I am trying to write a twitter sentiment analysis program with Scikit-learn in python 2.7. OS is Linux Ubuntu 14.04. In Vectorizing step, I want to use Hashingvectorizer(). To test the classifier accuracy it works fine with LinearSVC, NuSVC, GaussianNB, BernoulliNB and LogisticRegression classifiers, but …

Total answers: 3

How to find the closest word to a vector using word2vec

How to find the closest word to a vector using word2vec Question: I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors: array([-0.00449447, -0.00310097, 0.02421786, …], dtype=float32) Is there a …

Total answers: 3