Skip to content

Conversation

@avapapetti
Copy link

Ava Papetti

@avapapetti
Copy link
Author

I am having trouble using pytest-- I've uploaded my first stab at tests anyway but am not positive I am effectively testing each aspect of my code.

I have a few other questions as well:

  1. Would it be better to rename the variable names within the for loop of my recent Project instead of filtering with the same name each time?
  2. In the beginning where I am removing CNVs, I feel it is not the most sophisticated way to eliminate rows I don't need, but I felt that would be more efficient than using a for loop again-- is this OK?

@emilyyaklich
Copy link

emilyyaklich commented May 5, 2020

Hi Ava,

What exactly are you having trouble with regarding pytest? After looking at your tests, I think each unit test should test a function that you define in your main scripts to ensure it is performing as desired. So, for example, an example of a test for a function that drops the 'chrM' column from your data is below.
The function:

file1 = pd.read_csv('SL88824_20180802.cnv.csv', sep = "\t")

def cnv_drop_column(file1):
    file1_new = file1.drop(file1[file1.Chrom == 'chrM'].index)
    chroms = file1_new.Chrom.unique()
    return chroms 

and a test for this would be (located in your testing directory):

from example_main_script.py import cnv_drop_column

def test_cnv_drop_column():
    file1 = pd.read_csv('SL88824_20180802.cnv.csv', sep = "\t")
    chroms = cnv_drop_column(file1)
    assert 'chrM' not in chroms
  1. By renaming variables do you mean the "file1" and "file2" names? If your code is working and iterating through your data and giving the correct output I see no need to change the names through each iteration, but I also am a bit confused about what this question is asking.

  2. I think that the way that your are removing the rows is fine with

file1 = file1.drop(file1[file1.Chrom == 'chrM'].index)
file1 = file1[(np.abs(file1.Start - file1.Stop) >= min_cnv_length) & (file1.P_Value >= p_value_threshold)]

but I would define this as a function and apply the function to both files because you end up writing the same code twice for file1 and file2.

@avapapetti
Copy link
Author

Hi Emily,

Thank you for your reply- that cleared up a lot! So for this line in your example:

from example_main_script.py import cnv_drop_column

that goes at the beginning of each test, with 'cnv_drop_column' being the name of the function I'm testing? And is it OK to keep the tests in one file or is it better to have one file for each test?

In regards to pytest, I am able to import it into my notebook without any issues, but when I go to use it, I get "UsageError: Cell magic %%pytest not found."

@emilyyaklich
Copy link

emilyyaklich commented May 5, 2020

Yes, exactly. And the import of the functions can go at the top of your script and you can import all of the functions you will test from that file at one time. For example:

from example_main_script.py import cnv_drop_column, other_function, other function2

or you could also use:

from example_main_script.py import *

which will import all of the functions from that script.

That being said, I think it might be good to keep your tests located in one script, similar to the example that we worked through in class.

For the pytest, I believe you need to use:

!pytest

in order to get pytest to run.

Let me know if you have any other questions!

@avapapetti
Copy link
Author

Ok, got it!

One other thing. I had previously tried to create a function to remove the CNVs as you had suggested I do too, and I realized when I used '.apply()' that it was converting my dataframe into a series:

Screen Shot 2020-05-05 at 5 46 58 PM

And when I do '.apply(cnv_remover, axis = 1)', I get this error:
Screen Shot 2020-05-05 at 5 49 53 PM

Not sure how to bypass either of these issues.

@leesup
Copy link

leesup commented May 6, 2020

Could you potentially try without axis=1? It might be due to the fact that pandas series is a one-dimensional array!

@avapapetti
Copy link
Author

I think my main problem is I'm not sure how to access the rows where 'Chrom' is 'chrM'. For the code I have above, with file.drop(), I get this error message:

AttributeError: ("'Series' object has no attribute 'Chrom'", 'occurred at index Chrom')

Is there a way for me to specify for it to look at the Series with the Chrom values and then drop each index where chrM is found?

@avapapetti
Copy link
Author

Sorry for all of the questions about this, but do I have to use .apply()? Could I instead do

def cnv_remover(file):
    min_cnv_length = 1000
    p_value_threshold = .90
    
    file = file.drop(file[file.Chrom == 'chrM'].index)
    file = file[(np.abs(file.Start - file.Stop) >= min_cnv_length) & (file.P_Value >= p_value_threshold)]
    
    return file

Then save the new file this way: file1 = cnv_remover(file1)

@emilyyaklich
Copy link

emilyyaklich commented May 6, 2020

Hi Ava,

Yes, that looks good to me!

@avapapetti
Copy link
Author

avapapetti commented May 7, 2020

Great!

I'm having another issue with testing now. I have added an init.py file to the same directory as the module, but I keep getting a ModuleNotFoundError.

This is what I ran: from Package.Common_CNV_Finder import *

I've also updated my tests. Is it OK to test the output of my Common_CNV_Finder function is correct by comparing it to a nested for loop output that is reliable (based on test files with which I know the expected outcome)?

My thought was to do add this at the end of my test: assert_frame_equal(test_common_cnvs, for_loop_common_cnvs)

@leej3
Copy link

leej3 commented May 7, 2020

@avapapetti, I've submitted a pull request to your branch with some suggested changes. Just to clarify:

  • a module is a file ending in .py. Yours were notebooks (ending in ipynb).
  • a function is an isolated bit of code that can take input and output. It starts with def function_name(arg1,arg2): and is followed by the appropriate code. I have moved your code into functions
  • common_cnv_finder seems like a nice package name. I have renamed a directory to make that consistent with this.
  • __init__.py is required (not init.py, it's hard to write that in markdown unfortunately!)

Overall you have a nice solution to your problem. You were just lacking significantly in the packaging/modularization/testing side of things. Have a look through the changes to move forward. Merging them into your branch and working from there might be the easiest.

Moving forward, the most important thing to do is to complete the test template I have written called test_common_csv_finder. That will suffice to get most of the marks for testing. If you have time after that try to make sure you are passing all the tests (and add more if you can).

When completing test_common_csv_finder, think of it as a way to guarantee that if someone pip installs your package, and passes two file paths to common_csv_finder will it produce a result, and is that result correct. Add two small datafilee to tests/data and use them in this call of common_csv_finder

@avapapetti
Copy link
Author

Seems to be getting late for someone to pop into Zoom, so here are my questions:

  1. I've added tests for all of my functions and they all passed! Just wanted to make sure they are legitimate.

  2. I also wanted to double check that I've done all of the steps for packaging.

  3. Is it OK if tests are passing in pytest but not in CircleCI?

  4. If/when I'm ready to submit my final project, do I just push my final changes to my repository?

Thank you.

@leej3
Copy link

leej3 commented May 8, 2020

Sorry, thought you were happily working away.

I've added tests for all of my functions and they all passed! Just wanted to make sure they are legitimate.

yes. well done. Especially using pandas built in assertions. Very nice. In the future looking up pytest "fixtures" for supplying data to tests in a slightly cleaner way. What you have done is fine though.

I also wanted to double check that I've done all of the steps for packaging.

Correct. Your repository is a pip installable python package!

Is it OK if tests are passing in pytest but not in CircleCI?

Not really. Local environments change. People's set ups are different. Having online tests allows several people to have an identical system to confirm their tests are working.

I'm not certain but I think your tests are passing locally because I'm guessing you have a Mac laptop (which has a case insensitive file system). So when tests_dir is defined as "data" it finds it, even though it is named "Data". The online tests are on a linux system that doesn't make that odd logical jump.

Change either in the directory name or the test module and you will pass most of the tests on circleci. You have another test that is failing apart from that. you should see it if you type pytest tests locally.

If/when I'm ready to submit my final project, do I just push my final changes to my repository?

Correct. Your version in your repo at noon tomorrow is what will be graded

@avapapetti
Copy link
Author

That one is on me I realize my comment in the Google Doc was misleading, sorry about that!

There seems to be discrepancies between what's in the Data directory file locally and what's in the Data directory in my repository. When I try to commit changes, I keep getting things like this:
Screen Shot 2020-05-07 at 10 11 26 PM

Also, I'm unsure where I should be running pytest tests.

@avapapetti
Copy link
Author

Update: I figured out pytest tests. It was the example test you had added-- do I need to keep that file or should I delete it?

@leej3
Copy link

leej3 commented May 8, 2020

That's not a problem. .ipynb_checkpoints was added to .gitignore. That tells git to ignore it. But that directory had already been added. So now you are in a weird state where you struggle to add changes that occur in that directory. It shouldn't be in the repository anyway.

Update: I figured out pytest tests. It was the example test you had added-- do I need to keep that file or should I delete it?

You should absolutely delete it. It's in your way to getting that glorious green tick beside your pull request. It's also a demonstration that "passing all of the tests" is not necessarily a good thing.

@avapapetti
Copy link
Author

Oh my goodness there's the green check mark! Yay! The sample files I have inside the Data directory are up to date anyway, so there's nothing more I need to do there right? I really appreciate your prompt replies by the way.

@leej3
Copy link

leej3 commented May 8, 2020

Oh my goodness there's the green check mark! Yay!

Glorious isn't it? Well done. For the record adding DS_store, .ipynb_checkpoints, egg_info, and pycache to .gitignore is a good idea. And any other superfluous files. not a big deal for this project but worth keeping in mind.

But yes, all done.

@avapapetti
Copy link
Author

Thank you! That is good to know, I'll try to do that now. And thanks for all your help and insight!

@avapapetti
Copy link
Author

avapapetti commented May 8, 2020

So sorry it's late but one last thing-- I just tried to pull my repository and I got this:

From https://github.com/avapapetti/project_spring_2020
(*) branch HEAD -> FETCH_HEAD
fatal: refusing to merge unrelated histories

That won't affect anything when you compile my code tomorrow will it?

@leej3
Copy link

leej3 commented May 8, 2020

You are all good. Working tests on circleci are hard to argue with.

I think it may be just an issue with the command you ran. Or you ran it from a different git repository perhaps.

@emilyyaklich
Copy link

Hi Ava,

You did a really great job on this project. Pandas is a great tool when working with datasets and you employed it nicely. Also, the unit tests for your dataframes are great as well. In the future, I think it is good practice to add the .ipynb_checkpoints files to your .gitignore file. Overall, very well done!

Cheers,

Emily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants