concatenate file contents into a list with python

Question:

I wrote a small script in python to concatenate some lines from different files into one file. But somehow it doesn’t print anything I like it to print by the function I wrote. I tried to spot the problems, but after one evening and one morning, I still can’t find the problem. Could somebody help me please? Thanks a lot!

So I have a folder where around thousands of .fa files are. In each of the .fa file, I would like to extract the line starting with ">", and also do some change to extract the information I like. In the end, I would like to combine all the information extracted from one file into one line in a new file, and then concatenate all the information from all the .fa file into one .txt file.

So the folder:

% ls
EstimatedSpeciesTree.nwk    HOG9998.fa          concatenate_gene_list_HOG.py
HOG9997.fa          HOG9999.fa          output

One .fa file for example

>BnaCnng71140D [BRANA]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF

>Cadbaweo98702.t [THATH]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF

What I would like to have is one file like this

HOG9997.fa BnaCnng71140D:BRANA Cadbaweo98702.t:THATH
HOG9998.fa Bkjfnglks098:BSFFE dsgdrgg09769.t
HOG9999.fa Dsdfdfgs1937:XDGBG Cadbaweo23425.t:THATH Dkkjewe098.t:THUGB
# NOTE: the number of lines in each .fa file are uncertain. Also, some lines has [ ], but some lines has not.

So my code is

#!/usr/bin/env python3

import os
import re
import csv


def concat_hogs(a_file):
    output = []
    for row in a_file:  # select the gene names in each HOG fasta file
        if row.startswith(">"):
            trans = row.partition(" ")[0].partition(">")[2]
            if row.partition(" ")[2].startswith("["):
                species = re.search(r"[(.*?)]", row).group(1)
            else:
                species = ""
            output.append(trans + ":" + species)

    return 't'.join(output)



folder = "/tmp/Fasta/"
print("Concatenate names in " + folder)

for file in os.listdir(folder):
    if file.endswith('.fa'):
        reader = csv.reader(file, delimiter="t")
        print(file + concat_hogs(reader))

But the output only prints the file name with out the part that should be generated by the function concat_hogs(file). I don’t understand why.

Asked By: zzz

||

Answers:

Use standard Python libraries

In this case

  • regex (use a site such as regex101 to test your regex)
  • pathlib to encapsulate paths in a platform independent way
  • collections.namedtuple to make data more structured

A breakdown of the regex used here:
>([a-z0-9A-Z.]+?)s*(n|[([A-Z]+)]?n)

  • > The start of block character
  • (regex1) First matchig block
  • s* Any amount of whitespace (i.e. zero space is ok)
  • (regex2|regex3) A choice of two possible regex
  • regex1: + = One or more of characters in [class] Where class is any a to z or 0 to 9 or a dot
  • regex2: n = A newline that immediately follows the whitespace
  • regex3: [([A-Z]+)] = One or more upper case letter inside square brackets

Note1: The brackets create capture groups, which we later use to split out the fields.

Note2: The regex demands zero or more whitespace between the first and second part of the text line, this makes it more resiliant.

import re
from collections import namedtuple
from pathlib import Path
import os

class HOG(namedtuple('HOG', ['filepath', 'filename', 'key', 'text'], defaults=[None])):
    __slots__ = ()
    def __str__(self):
        return f"{self.key}:{self.text}"
    
regex = re.compile(r">([a-z0-9A-Z.]+?)s*(n|[([A-Z]+)]?n)")

path = Path(os.path.abspath("."))
wildcard = "*.fa"
files = list(path.glob("*.fa"))

print(f"Searching {path}/{wildcard} => found {len(files)} files")

data = {}

for file in files:
    print(f"Processing {file}")
    with file.open() as hf:
        text = hf.read(-1)
        
    matches = regex.findall(text)
    
    for match in matches:
        key = match[0].strip()
        text = match[-1].strip()
        if file.name not in data:
            data[file.name] = []
            
        data[file.name].append(HOG(path, file.name, key, text))

print("Now you have the data you can process it as you like")
for file, entries in data.items():
    line = "t".join(list(str(e) for e in entries))
    print(file, line)

# e.g. Write the output as desired
with Path("output").open("w") as fh:
    for file, entries in data.items():
        line = "t".join(list(str(e) for e in entries))
        fh.write(f"{file}t{line}n") 
Answered By: Jay M

The error comes from you passing the name of the file to your concat_hogs function instead of an iterable file handle. You are missing the actual opening of the file for reading purposes.


I agree with Jay M that your code can be simplified drastically, not least by using regular expressions more efficiently. Also pathlib is awesome.

But I think it can be even more concise and expressive. Here is my suggestion:

#!/usr/bin/env python3
import re
from pathlib import Path


GENE_PATTERN = re.compile(
    r"^>(?P<trans>[w.]+)s+(?:[(?P<species>w+)])?"
)


def extract_gene(string: str) -> str:
    match = re.search(GENE_PATTERN, string)
    return ":".join(match.groups(default=""))


def concat_hogs(file_path: Path) -> str:
    with file_path.open("r") as file:
        return 't'.join(
            extract_gene(row)
            for row in file
            if row.startswith(">")
        )


def main() -> None:
    folder = Path("/tmp/Fasta/")
    print("Concatenate names in", folder)
    for element in folder.iterdir():
        if element.is_file() and element.suffix == ".fa":
            print(element.name, concat_hogs(element))


if __name__ == '__main__':
    main()

I am using named capturing groups for the regular expression because I prefer it for readability and usability later on.

Also I assume that the first group can only contain letters, digits and dots. Adjust the pattern, if there are more options.

PS

Just to add a few additional explanations:

The pathlib module is a great tool for any basic filesystem-related task. Among a few other useful methods you can look up there, I use the Path.iterdir method, which just iterates over elements in that directory instead of creating an entire list of them in memory first the way os.listdir does.

The RegEx Match.groups method returns a tuple of the matched groups, the default parameter allows setting the value when a group was not matched. I put an empty string there, so that I can always simply str.join the groups, even if the species-group was not found. Note that this .groups call will result in an AttributeError, if no match was found because then the match variable will be set to None. It may or may not be useful for you to catch this error.

For a few additional pointers about using regular expressions in Python, there is a great How-To-Guide in the docs. In addition I can only agree with Jay M about how useful regex101.com is, regardless of language specifics. Also, I think I would recommend using his approach of reading the entire file into memory as a single string first and then using re.findall on it to grab all matches at once. That is probably much more efficient than going line-by-line, unless you are dealing with gigantic files.

In concat_hogs I pass a generator expression to str.join. This is more efficient than first creating a list and passing that to join because no additional memory needs to be allocated. This is possible because str.join accepts any iterable of strings and that generator expression (... for ... in ...) returns a Generator, which inherits from Iterator and thus from Iterable. For more insight about the container inheritance structures I always refer to the collections.abc docs.

Answered By: Daniil Fajnberg
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.