I have 1,000 files; the start of each file all look like this:
!dataset_description = Analysis of POF D119 mutation.
!dataset_type = Expression profiling by array
!dataset_pubmed_id = 17318176
!dataset_platform = GPL1322
The aim: I want to transform this information into a list so I can make an excel spreadsheet between all the files; i.e. I want the list to look like this:
[Analysis_of_POF_D119_mutation,Expression_profiling_by_array,17318176,GPL1322]
I have this code (this is just to extract the first variable, "!dataset_description", however, I would subsequently run the code on each variable of interest i.e. !dataset_type, !dataset_pubmed_id, !dataset_platform):
OpenDataset = open(sys.argv[1], 'r')
Dataset = OpenDataset.readlines()
ListOfInformation = []
formatted_line = lambda x: "_".join(line.strip().split("=")[x].split())
for line in Dataset:
if line.startswith("!dataset_description"):
description = formatted_line(1)
print description
The code works, however, I am now at a stage where I understand python basics, and I want to start coding more "pythonically". I have two questions.
- It seems silly to use the lambda expression that I am using. "x" in the lambda expression will always be 1, since I will always want what comes after the "=" sign. Therefore x isn't really a "variable", but then I can't have a lambda expression without a variable.
I tried to change the variable to being what the line starts with, which is the true variable, doing something like this:
formatted_line = lambda x: "_".join(line.strip().split("=")[1].split()) if line.startswith(x)
However, this code returns a syntax error.
Would someone know how to make the above lambda expression work.
- These files have the potential to be really really big. However, the information that I need is at the start of the file, and all start with the "!" symbol. So it seems silly to read in the whole file, when I'll just need X number of lines at the start of the file, all of which start with "!" (the exact number of lines per file will be variable). Is there a way to read in just the lines starting with "!"; or is it quicker just to use file.readlines().
2 Answers 2
You certainly can have a lambda expression without an argument.
However, in this case, you should actually pass an argument: the line itself. That is the thing that you're operating on, therefore it should be passed into the function.
Your if statement does not work because an inline if in Python must always have an else clause. In this case the value in else is the empty string.
So:
formatted_line = lambda line: "_".join(line.strip().split("=")[1].split()) if line.startswith(x) else ""
If you only want to read values until the lines stop starting with !, you can use itertools.takewhile:
from itertools import takewhile
...
for line in takewhile(lambda line: line.startswith("!"), Dataset):
Comments
It raises SyntaxError, because you're missing an else branch. The "expression if" or "inline if" has the syntax: <value to return when True> if <condition> else <value when False> You can't use elif.
So the code might look like this:
formatted_line = lambda x: "_".join(line.strip().split("=")[1].split()) if line.startswith(x) else "" # You can replace this with `None`.
1always? Pass thelineinstead.lambdaversion, what will be the result of the expression if the line doesn't start withx? That is why it produces a Syntax error.lambdainstead ofdeffor a named function is generally considered bad style in Python, although that rule is sometimes bent, eg when creating a key function that's used as an arg tosortorsortedand then immediately re-used as an arg toitertools.groupby. Apart from brevity, lambdas have no advantage over full function definitions, but they have several disadvantages. So you should only use them when a simple anonymous function is appropriate.