Extract allows users to dive into specifics on a site and beyond. With this API we allow developers to extract article text, topics, and retrieve more meta-data about articles, blog posts, and stories.

Response Attributes

original_url
The url that was passed into Embedly. This will be something like a bit.ly shortened link or if there is no redirect it will be the same as the url attribute.

url
The effective url of the request. Whatever Embedly found at the end of any
redirects.

type
Returns the type of the document at this URL, they can be one of the following:

  • html: The most common response. The resource is an html document.
  • text: The response is a plain text document.
  • image: This is a static viewable image.
  • video: This is a playable video.
  • audio: This is a playable audio.
  • rss: The resource is an rss feed.
  • xml: The resource is an xml document.
  • atom: The resource is an atom feed.
  • json: The resource is a json document.
  • ppt: The resource is a PowerPoint document.
  • link: This is a general embed that may not contain HTML.
  • error: When accessing multiple urls at once Embedly will not throw HTTP errors as normal. Instead, it will return an error type response that includes the url, error_message and error_code.

cache_age
How long Embedly is going to cache the response for? Generally, this is for a day, unless some external factor tells us to reevaluate the resource.

safe
Safe is an attribute that tells you if the url is on a phishing or malware list. Embedly uses Google's
Safe Browsing API to obtain a list of malicious urls. By using this API through our service, you agree to itsterms of service.

provider_name
The name of the resource provider.

provider_url
The url of the resource provider.

provider_display
For display purposes we include provider_display, it's the subdomain, hostname, and public suffix of the provider.

favicon_url
The url of the favicon.

favicon_colors
List of dominant colors extracted from favicon_url.

title
The title of the resource. It's picked in the following order:

  • The rss entry's title
  • The oEmbed title
  • The open graph title
  • The meta title tag
  • The title attribute in the head element

authors
A list of all the authors that are associated with this article. Each author has a url and name. Here is an example response:

    [{
      "name": "Sean Creeley"
      "url": "http://blog.embed.ly/screeley"
    }]

Most articles have only one author, but authors makes it flexible enough to add more if necessary.

published
A representation of the date which the article was published in milliseconds. If an offset is present, then there was timezone data present, otherwise we assume the Date is in UTC. Like all dates, this is a little confusing, so we will explain. Say the Embedly parser came across the following HTML:

<span class="pubdate">Aug 24, 2012</span>

Because there is no timezone information, Embedly will not return an offset and the published attribute will be in UTC. We will return the following response:

"published": 1345766400000

offset
The UTC offset of the date in milliseconds. See the above section for more information about offset and how to use it with the published time.

description
The description of the resource. It's picked in the following order:

  • The oEmbed description
  • The open graph description
  • The meta description tag
  • An excerpt pulled programmaticly by Embedly

lead
Often there is a lead paragraph that is a brief summary of the rest of the
article. Embedly tries to pull this lead paragraph out for a better reading
experience. It is always a p tag, i.e.::

"lead": "<p>This is a summary of the below article</p>"

content
This is the html that we pulled from the URL. It's been sanitized, so it will only contain the following tags::

'a', 'abbr', 'acronym', 'b', 'big', 'blockquote', 'br', 'cite', 'code',
'del', 'dfn', 'em', 'i', 'ins', 'kbd', 'mark', 'pre', 'q', 's', 'samp',
'small', 'span', 'strike', 'strong', 'sub', 'sup', 'time', 'tt', 'u',
'var', 'p', 'div', 'a', 'h2', 'h3', 'h4', 'h5', 'h6', 'img', 'ol', 'ul',
'li'

All tag attributes have been removed as well. The only effective attributes are:

  • href on an a tag
  • src on an img tag

keywords

Returns the meta keywords from the page.

The keywords object gives you a list of ranked keywords extracted from the article or blog text of a URL. This is different from the meta keywords defined by the page. Embedly uses its own ranking system to determine which keywords are the most relevant. Each keyword has two values:

``name``
    Name of the keyword.

``score`` (DEPRECATED)
    Returns 0 for no specific relevance.
    Score of the keyword from the article, larger numbers infer more relevance.

language

Returns the language code for the page. Typically ISO639-1 standard, i.e. en, fr, etc.

entities (DEPRECATED)

Returns an empty list.

The entities object gives you a list of proper nouns (persons, places, and things) extracted from the article or blog text of a URL. Along with each entity name, Embedly counts the number of times it appears in the text. Each entity has two values:

``name``
    Name of the entity (person, place, or thing).

``count``
    Number of times an entity appears in the text.

images
The images object is a list of, at most, 5 images that Embedly found while processing the URL. Along with the list of images we pull various pieces of information about each image including dimensions, dominant colors, entropy, captions, and size. Each image contains:

``width``: The width of the image in pixels (required)
``height``: The height of the image in pixels (required)
``size``: The size in bytes of the image.
``caption``: An image  caption is the short description or sentence that many sites provide below each image. This text usually describes what the image is depicting.
``entropy``: Image entropy can be roughly thought of as how "busy" an image is. Image entropy can be useful in programmatically choosing the type of image to display. For instance, if an API user wants to display photographic type images, but not logos, they can ignore images with lower entropy.
``colors``: The dominant colors of an image are those colors that make up the majority of
an image. The ``color`` attribute is the RGB representation of the color and and the ``wight`` is how prevalent that color is in the image.

app_links
There are few different specs for deep linking, they include App Links, Twitter's App Card and Google's App Indexing. Embedly parses and returns all three in a list of objects. Each is dependent on the spec, but two attributes are the same for each:

``namespace``: The type of the deep link. ``ai``, ``twitter`` or ``google 
``type``: The type of the device the deep link targets ``ipad``, ``iphone``, ``andriod``, ``ios`` etc.

media
The media is primary type of content (video, photo, etc.) that is associated with a url. It follows the general pattern of the /1/oembed Response, but with only a limited set of attributes. Note: It is optional and only available if we can classify it as such type.

type
The resource type. Valid values, along with value-specific parameters, are described below.

The photo type

This type is used for representing static photos. The following parameters are
defined:

url
The source URL of the image. Consumers should be able to insert this URL
into an<img>element. Only HTTP and HTTPS URLs are valid.

width
The width in pixels of the image specified in the url parameter.

height
The height in pixels of the image specified in the url parameter.

The video type

This type is used for representing playable videos. The following parameters
are defined:

html
The HTML required to embed a video player. The HTML should have no padding
or margins. Consumers may wish to load the HTML in an off-domain iframe to
avoid XSS vulnerabilities.

width
The width in pixels required to display the HTML. If not supplied
the HTML returned will expand horizontally to the size of its parent
container.

height
The height in pixels required to display the HTML. If not supplied
the HTML returned will expand vertically to the size of its parent
container.

The rich type

This type is used for rich HTML content that does not fall under one of the
other categories. The following parameters are defined:

html (required)
The HTML required to display the resource. The HTML should have no padding
or margins. Consumers may wish to load the HTML in an off-domain iframe to
avoid XSS vulnerabilities. The markup should be valid XHTML 1.0 Basic.

width
The width in pixels required to display the HTML. If not supplied
the HTML returned will expand horizontally to the size of its parent
container.

height
The height in pixels required to display the HTML. If not supplied the HTML returned will expand vertically to the size of its parent container.

Optional media fields

age_restriction
A string value for age restrictions; valid values include variations of 18+, 13+, R, PG. Only present if populated.

duration
A float value for duration of video, audio segment in seconds. Only present if populated.

Error Codes

JSON Requests

400 Bad Request

  • Required "url" parameter is missing.
  • Either "url" or "urls" parameter is reqiured.
  • Invalid URL format.
  • Invalid "maxheight" parameter.
  • Invalid "maxwidth" parameter.
  • Invalid "urls" parameter, exceeded max count of 20.

401 Unauthorized

  • Invalid key or oauth_consumer_key provided: , contact: [email protected].
  • The provided key does not support this endpoint: , contact: [email protected].
  • URL is private or restricted.

403 Forbidden

404 Not Found
URL Not Found or unable to be accessed by Embedly servers.

Note: Error message will contain Upstream server response.

500 Server issues
Embed.ly is having trouble with this url. Please try again or contact us, [email protected].

501 Not Implemented
Not implemented for format: acceptable values are {json}.

503 Service Unavailable
Note: This happens if our service is down, please contact us immediately: [email protected].

JSONP Requests

Format

``callbackFunction({"url": "url with error", "error_code": "error code",
"error_message": "error message", "type": "error"})``

Error Response

``jsonp1273162787542({"url": "http://flickr.com/embedly", "error_code": 404, "error_message":
"HTTP 404: Not Found", "type": "error"})``

👍

Want to try out the Extract API?

Try out our handy Extract API Explorer. It's awesome and will allow you to preview the response of any URL.

Go to the Extract API Explorer

Language
Authorization
Query