text-extraction

Python BeautifulSoup issue in extracting direct text in a given html tag

Python BeautifulSoup issue in extracting direct text in a given html tag Question: I am trying to extract direct text in a given HTML tag. Simply, for <p> Hello! </p>, the direct text is Hello!. The code works well except with the case below. from bs4 import BeautifulSoup soup = BeautifulSoup(‘<div> <i> </i> FF Services …

Total answers: 2

Python Pandas Extract text between a word and a symbol

Python Pandas Extract text between a word and a symbol Question: I am trying to extract text between a word and a symbol. Here is the input table. And my expected output is like this. I do not want to have the word ‘Team:’ and ‘<>’ in the output. I tried something like this but …

Total answers: 3

extract the domain name from the urls in another list

extract the domain name from the urls in another list Question: extract the domain name from the urls in another list. Also you need to extract the ending string which the url ends with. For example, https://www.example.com/market.php — In this example, domain name is www.example.com and the ending string is php Extract the domains and …

Total answers: 1

extracting all tables using tabula

extracting all tables using tabula Question: While reading a pdf file using df = tabula.read_pdf(pdf_file, pages=‘all’) —> displays all tables from all pages. but when converting into a Pandas dataframe using tables = pd.DataFrame(pdf_file, pages = ‘all’, lattice = ‘True’)[0])—> display only the table on the first page. Asked By: arvin || Source Answers: The …

Total answers: 1

Is there a way in python to extract only the CORE TEXT (without boxes, footer etc.) from a pdf?

Is there a way in python to extract only the CORE TEXT (without boxes, footer etc.) from a pdf? Question: I am trying to extract only the core text from a "rich" pdf document, meaning that it has a lot of tables, graphs, boxes, footers etc. in which I am not interested in. I tried …

Total answers: 2

How to match text in two different file and extract values

How to match text in two different file and extract values Question: So I have two files. One yaml file that contains tibetan words : its meaning. Another csv file that contains only word and it’s POStag. As below: yaml file : ད་གདོད: ད་གཟོད་དང་དོན་འདྲ། ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན། ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན། ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་། ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས། csv file : …

Total answers: 3

Extract all phrases from a pandas dataframe based on multiple words in list

Extract all phrases from a pandas dataframe based on multiple words in list Question: I have a list, L: L = [‘top’, ‘left’, ‘behind’, ‘before’, ‘right’, ‘after’, ‘hand’, ‘side’] I have a pandas DataFrame, DF: Text the objects are both before and after the person the object is behind the person the object in right …

Total answers: 3

Extract Numeric info from Pandas column using regex

Extract Numeric info from Pandas column using regex Question: I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column: Text Dimensions: 23"/60 Dimensions: 23" / 60 Dimensions: 48" Dimensions: 22.5X8.25 Dimensions: 80IN Dimensions: 567 S Dimensions: 22.5X8.25 Dimensions: 26INNP Dimensions: 24" x 55" with pipe 16 x 7 I am using …

Total answers: 1

extract strings from HTML tag pandas

extract strings from HTML tag pandas Question: How do I extract the following strings using str.extract or regex or any efficient way using python pandas in this tags below <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> <a href="http://vine.co" rel="nofollow">Vine – Make a Scene</a> <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> am using: .str.extract(‘(>[A-Za-z])<‘) I want this …

Total answers: 1

what is fastest way to convert pdf to jpg image?

what is fastest way to convert pdf to jpg image? Question: I am trying to convert multiple pdfs (10k +) to jpg images and extract text from them. I am currently using the pdf2image python library but it is rather slow, is there any faster/fastest library than this? from pdf2image import convert_from_bytes images = convert_from_bytes(open(path,"rb").read()) …

Total answers: 3