YCombinator-Scraper provides a web scraping tool for extracting data from Workatastartup website. The package uses Selenium and BeautifulSoup to navigate through the pages and extract information.
Documentation: https://nneji123.github.io/ycombinator-scraper
Source Code: https://github.com./nneji123/ycombinator-scraper
Scrape public LinkedIn profile data at scale with Proxycurl APIs.
- Scraping Public profiles are battle tested in court in HiQ VS LinkedIn case.
- GDPR, CCPA, SOC2 compliant.
- High rate limit - 300 requests/minute.
- Fast - APIs respond in ~2s.
- Fresh data - 88% of data is scraped real-time, other 12% are not older than 29 days.
- High accuracy.
- Tons of data points returned per profile
Built for developers, by developers.
-
Web Scraping Capabilities:
- Extract detailed information about companies, including name, description, tags, images, job links, and social media links.
- Scrape job-specific details such as title, salary range, tags, and description.
-
Founder and Company Data Extraction:
- Obtain information about company founders, including name, image, description, linkedIn profile, and optional email addresses.
-
Headless Mode:
- Run the scraper in headless mode to perform web scraping without displaying a browser window.
-
Configurability:
- Easily configure scraper settings such as login credentials, logs directory, automatic install of webdriver based on browser with
webdriver-manager package
and using environment variables or a configuration file.
- Easily configure scraper settings such as login credentials, logs directory, automatic install of webdriver based on browser with
-
Command-Line Interface (CLI):
- Command-line tools to perform various scraping tasks interactively or in batch mode.
-
Data Output Formats:
- Save scraped data in JSON or CSV format, providing flexibility for further analysis or integration with other tools.
-
Caching Mechanism:
- Implement a caching feature to store function results for a specified duration, reducing redundant web requests and improving performance.
-
Docker Support:
- Package the scraper as a Docker image, enabling easy deployment and execution in containerized environments or run the prebuilt docker image
docker pull nneji123/ycombinator_scraper
.
- Package the scraper as a Docker image, enabling easy deployment and execution in containerized environments or run the prebuilt docker image
- Python 3.9+
- Chrome or Chromium browser installed.
$ pip install ycombinator-scraper
$ ycscraper --help
# Output
YCombinator-Scraper Version 0.7.0
Usage: python -m ycombinator_scraper [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
login
scrape-company
scrape-founders
scrape-job
version
$ git clone https://github.com./Nneji12/ycombinator-scraper
$ cd ycombinator-scraper
$ docker build -t your_name/scraper_name . # e.g docker build -t nneji123/ycombinator_scraper .
$ docker run nneji123/ycombinator_scraper python -m ycombinator_scraper --help
- click: Enables the creation of a command-line interface for interacting with the scraper tool.
- beautifulsoup4: Facilitates the parsing and extraction of data from HTML and XML in the web scraping process.
- loguru: Provides a robust logging framework to track and manage log messages generated during the scraping process.
- pandas: Utilized for the manipulation and organization of data, particularly in generating CSV files from scraped information.
- pathlib: Offers an object-oriented approach to handle file system paths, contributing to better file management within the project.
- pydantic: Used for data validation and structuring the models that represent various aspects of scraped data.
- pydantic-settings: Extends Pydantic to enhance the management of settings in the project.
- selenium: Employs browser automation for web scraping, allowing interaction with dynamic web pages and extraction of information.
ycscraper scrape-company --company-url https://www.workatastartup.com/companies/example-inc
This command will scrape data for the specified company and save it in the default output format (JSON).
from ycombinator_scraper import Scraper
scraper = Scraper()
company_data = scraper.scrape_company_data("https://www.workatastartup.com/companies/example-inc")
print(company_data.model_dump_json(by_alias=True,indent=2))
Pydantic is used under the hood so methods like model_dump_json
are available for all the scraped data.
You can view more examples here: Examples
We welcome contributions from the community! To contribute to this project, follow the steps below.
You can use Gitpod, a free online VS Code-like environment, to quickly start contributing.
-
Clone the repository:
git clone https://github.com./nneji123/ycombinator-scraper.git cd ycombinator-scraper
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Make sure to run tests before submitting a pull request.
pip install -r requirements-test.txt
pytest tests
If you make changes to documentation, install the necessary dependencies:
pip install -r requirements-docs.txt
mkdocs serve
We use pre-commit
to ensure code quality. Install it by running:
pip install pre-commit
pre-commit install
Now, pre-commit
will run automatically before each commit to check for linting and other issues.
-
Fork the repository and create a new branch for your contribution:
git checkout -b feature-or-fix-branch
-
Make your changes and commit them:
git add . git commit -am "Your meaningful commit message"
-
Push the changes to your fork:
git push origin feature-or-fix-branch
-
Open a pull request on GitHub. Provide a clear title and description of your changes.
The documentation is made with Material for MkDocs and is hosted by GitHub Pages.
YCombinator-Scraper is distributed under the terms of the MIT license.