A dosage file (used in computational genetics) is formatted like this:
// ID TAG geno1 geno2 geno3 ... 1->76016 DOSE 1.871 1.832 1.897 1.995 1.884 1.856 1.405 1.853 1.998 1.885 1.390 1.971 1.890 1.699 2->76018 DOSE 1.877 1.832 1.897 1.995 1.884 1.856 1.405 1.853 1.998 1.885 1.390 1.971 1.890 1.699 3->76063 DOSE 1.877 1.832 1.893 1.995 1.884 1.856 1.405 1.853 1.998 1.885 1.390 1.971 1.890 1.699 4->76015 DOSE 1.877 1.832 1.897 1.994 1.884 1.856 1.405 1.853 1.998 1.885 1.390 1.971 1.890 1.699 5->76023 DOSE 1.877 1.832 1.897 1.995 1.885 1.856 1.405 1.853 1.998 1.885 1.390 1.971 1.890 1.699 6->76030 DOSE 1.877 1.832 1.897 1.995 1.884 1.856 1.405 1.853 1.998 1.885 1.390 1.971 1.890 1.699 7->76044 DOSE 1.877 1.832 1.897 1.995 1.884 1.856 1.407 1.853 1.998 1.885 1.390 1.971 1.890 1.699 8->76008 DOSE 1.877 1.832 1.897 1.995 1.884 1.856 1.405 1.858 1.998 1.885 1.390 1.971 1.890 1.699 9->76014 DOSE 1.877 1.832 1.897 1.995 1.884 1.856 1.405 1.853 1.999 1.885 1.390 1.971 1.890 1.699 10->76011 DOSE 1.877 1.832 1.897 1.995 1.884 1.856 1.405 1.853 1.998 1.880 1.390 1.971 1.890 1.699
Here is a class for reading this kind of file:
Dosage
import java.io.*;
import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream;
/**
* Created by vu.co.kaiyin.ReadMachDosage on 15/11/14.
*/
public class Dosage {
final String filename;
final Pattern splitPattern;
final int nColFile;
final int nCol;
final int nRow;
public Dosage(String filename) throws Exception {
this.filename = filename;
splitPattern = Pattern.compile("\\s+");
int[] dimArray = dimensions();
nColFile = dimArray[0];
nCol = dimArray[1];
nRow = dimArray[2];
}
private int[] dimensions() throws Exception {
int nColFileCounter = 0;
int nRowCounter = 0;
try(
InputStream fileStream = new FileInputStream(filename);
InputStream gzipStream = new GZIPInputStream(fileStream);
Reader reader = new InputStreamReader(gzipStream, "US-ASCII");
BufferedReader bufferedReader = new BufferedReader(reader);
) {
String line;
while ((line = bufferedReader.readLine()) != null) {
if (nRowCounter == 0) {
nColFileCounter = splitPattern.split(line).length;
}
nRowCounter++;
}
return new int[]{nColFileCounter, nColFileCounter - 2, nRowCounter};
}
}
public double[][] readCols(int start, int end) throws Exception {
// Data indices: x x 1 2 3
// Indices: 0 1 2 3 4
// Columns: 1->241 DOSE 1.1783 0.3498 1.9834 ...
if(start < 1) start = 1;
if(end > nCol) end = nCol;
start++;
end += 2; // java end index is exclusive
try(
InputStream fileStream = new FileInputStream(filename);
InputStream gzipStream = new GZIPInputStream(fileStream);
Reader reader = new InputStreamReader(gzipStream, "US-ASCII");
BufferedReader bufferedReader = new BufferedReader(reader);
) {
String line;
String[] fields;
ArrayList<double[]> dosageList = new ArrayList<>();
while ((line = bufferedReader.readLine()) != null) {
fields = splitPattern.split(line);
double[] dosage = new double[end - start];
for (int i = 2; i < end; i++) {
dosage[i - 2] = Double.parseDouble(fields[i]);
}
dosageList.add(dosage);
}
double[][] dosageMatrix = new double[dosageList.size()][];
dosageMatrix = (double[][]) dosageList.toArray(dosageMatrix);
return dosageMatrix;
}
}
public double[][] readAll() throws Exception {
return readCols(1, nCol);
}
}
TestDosage
public class TestDosage {
public static void main(String[] args) throws Exception {
String filename = "/tmp/chr22.dose.gz";
Dosage dosage = new Dosage(filename);
double[][] doseMatrix = dosage.readCols(1, 3);
PrintArray.print(doseMatrix);
double[][] doseMatrixFull = dosage.readAll();
PrintArray.print(doseMatrixFull);
}
}
-
\$\begingroup\$ Thanks for not modifying the question after posting. Since the original question had no answers, and your second question didn't point out specific modifications, I've merged your two questions. \$\endgroup\$200_success– 200_success2014年11月16日 13:03:23 +00:00Commented Nov 16, 2014 at 13:03
-
\$\begingroup\$ so what is you question? \$\endgroup\$devops– devops2014年11月16日 15:47:30 +00:00Commented Nov 16, 2014 at 15:47
-
\$\begingroup\$ It's for code review, I imagine it's up to the reviewer to raise suggestions for improvement and no question is needed from OP's part? \$\endgroup\$qed– qed2014年11月16日 15:49:39 +00:00Commented Nov 16, 2014 at 15:49
1 Answer 1
It's a very bad practice when a method is declared with throws Exception
. The purpose of throwing exceptions is to signal the caller that something went wrong, and give it a chance to recover gracefully. The more generic the exception,
the less it helps the caller to recover.
The code opens and unzips the file twice, in different methods. This is bad for many reasons:
- Waste of processing
- The code to open the file (input stream, buffered input stream, unzipping) is duplicated, which goes against the DRY (don't repeat yourself) principle
The collaborating classes are not well designed:
Dosage
: read dosage from file, having properties likefilename
,splitPattern
PrintArray
: utility class with aprint
method to print a matrix
This is incoherent design, with not much logic between the elements. I suggest to rethink the classes and their responsibilities and properties. How about something like this:
DosageMatrix
: a simple object to contain a dosage matrix.- May have a
print
method to print the matrix in a nice format - May have a
createFromZipFile
factory method that can process a zip file and create aDosageMatrix
instance. The method could takestartCol, endCol
parameters
- May have a
When printing a double[][] matrix
,
you can use a for-each instead of old fashioned for (;;)
.
Instead of this:
for(int i=0; i<nrow; i++) { for(int j=0; j<ncol; j++) { sb.append(String.format("%f\t", matrix[i][j])); }
Something like this, without the tedious i, j
loop index variables:
for (double[] row : matrix) {
for(double value : row) {
sb.append(String.format("%f\t", value));
}
I don't know how important is the \t
separator to you.
If you don't mind using ,
instead,
then you could further simplify by using Arrays.toString
and eliminate a nested loop:
for (double[] row : matrix) {
sb.append(Arrays.toString(row));
-
\$\begingroup\$ Printing is just an auxiliary thing for testing, will not be included in the final package. I have made a version that reads the file only once and will post it in a second. \$\endgroup\$qed– qed2014年11月16日 20:27:52 +00:00Commented Nov 16, 2014 at 20:27
-
\$\begingroup\$ Thanks for the review. A new version is here: codereview.stackexchange.com/questions/70026/… \$\endgroup\$qed– qed2014年11月16日 20:33:02 +00:00Commented Nov 16, 2014 at 20:33