Comparing read_csv with spark_read_csv

Reading in a csv file into R using dplyr’s `read_csv()` function is so simple. The syntax & parameters of dplyr are fairly easy to remember, once you’ve done it a few times.

read_csv(file,
col_names = TRUE,
col_types = NULL,
locale = default_locale(),
na = c(“”, “NA”),
quoted_na = TRUE,
quote = “””,
comment = “”,
trim_ws = TRUE,
skip = 0, n_max = Inf,
guess_max = min(1000, n_max),
progress = show_progress()
)

I’ve only just started working with big data sets, & was began wondering if what I know about the dplyr syntax can be carried over to sparklyr’s spark_read_csv() function.

While not exactly the same, but if you know one, you can quite easily pick the other. There’s an additional parameter `sc`, aka spark connection, that’s required.

spark_read_csv(
sc,
name,
path,
header = TRUE, # FALSE forces a “V_” prefix
columns = NULL,
infer_schema = TRUE, # to infer column data type
delimiter = “,”,
quote = “””,
escape = “\”,
charset = “UTF-8”,
null_value = NULL,
options = list(),
repartition = 0, # number of partitions to distribute the generated table.
memory = TRUE,
overwrite = TRUE, …
)

Leave a Reply Cancel reply