Skip to content

90% accuracy claim. #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
varadhbhatnagar opened this issue Sep 16, 2020 · 6 comments
Open

90% accuracy claim. #1

varadhbhatnagar opened this issue Sep 16, 2020 · 6 comments

Comments

@varadhbhatnagar
Copy link

varadhbhatnagar commented Sep 16, 2020

It is mentioned on the website (https://codist-ai.com/) that this model gives 90% accuracy.
Can you elaborate what exactly is this accuracy and how is it measured?

@rcshubhadeep
Copy link
Contributor

rcshubhadeep commented Sep 16, 2020

Hello @varadhbhatnagar we did a pre-training on a corpus of over 6M lines of logical python code. (we injected some special tokens such as (indent), (dedent) etc. to keep the logical structure of the code.) And then we fine tuned the model on a Binary Classification problem. Where the model is shown a pair of tokens, where we have the first set of tokens from code and the second set from the comments and the task is to predict whether they match or not. We fine tuned for this task using about 35K pairs. In that task, at the training time, the training F1 score reaches 90%.

Hope this answers your question.

@varadhbhatnagar
Copy link
Author

varadhbhatnagar commented Sep 16, 2020

Thanks. Is there a paper associated with this project? And it this related to the Microsoft Codebert in anyway?

@rcshubhadeep
Copy link
Contributor

It is not related to MS Codebert (Apart from sharing the same name). The methodology we followed is inspired by the CuBERT paper with our own methods and ideas blend into it. We have not published any paper on yet. But the model is open sourced for everyone to use.

@rcshubhadeep
Copy link
Contributor

Thanks for asking the questions 👍

@varadhbhatnagar
Copy link
Author

I wanted to get an idea about the method complexity that this model can handle. For training and testing, did you use simple methods similar to files in test_files directory ?

@rcshubhadeep
Copy link
Contributor

Hi,

We fine tuned this model on the task using py150k Dataset

But just to clarify this dataset has 150K Python files. We used our open source library tree-hugger to mine those files to create a (method, docstring) tuple dataset. We then swapped about 50% of those docstrings and marked them as a negative class while the rest is positive. And then used the pretrained model for fine tuning on this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants